Abstract: Clustering is one of fundamental tasks in unsupervised learning and plays a very important role in various application areas. Optimal transport theory is a very popular optimization theory in recent decades. It has been excavated and promoted from a practical application problem about transport and has had a huge impact on many fields. The Wasserstein distance is one of the most fundamental metrics on spaces of probability measures and enjoys several significant benefits: (i) it incorporates the geometry of the ground space, (ii) it not only describes the distance between two distributions, but also explains how to transport one distribution to another, and (iii) it is applicable to distributions with different dimensions, even to discrete and continuous distributions. The Wasserstein distance provides an useful tool for the clustering of distributions because it captures key shape characteristics of the distributions.The connection between optimization methodology and clustering algorithms is not only helpful to advance the understanding of the principle and theory of existing clustering algorithms, but also useful to inspire new ideas of efficient clustering algorithms.