Clustering

Last update: 21 Apr 2025 21:17
First version:

A topic in data mining and statistics: given a big bunch of data points, assign them to a discrete set of groups in a way which somehow reflects the natural divisions among them, without knowing in advance what the groups are. This is the unsupervised counterpart to classification. (You see where the connection with induction comes in.) This is an important subject, but one of the topics I most dislike teaching in data mining, because the students' natural question is always "how do I know when my clustering algorithm is giving me a good solution?", and it's very hard to give them a reasonable answer. I think this is because most other data-mining problems are basically predictive, and so one can ask how good the prediction is; what's the best way to turn clustering into a prediction problem? (Probabilistic mixture models suggest themselves, of course.)

David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
Isabelle Guyon, Ulrike von Luxburg, and Robert C. Williamson, "Clustering: Science or Art?" [Online PDF]
Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data

Margareta Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering", pp. 121--128 in NIPS 2008 [A rebuttal to Kleinberg's paper on clustering. Thanks to "arthegall" and Ed Vielmetti for pointers.]
David Arthur and Sergei Vassilvitskii, "k-means++: The Advantages of Careful Seeding" [PDF preprint via Dr. Arthur]
David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003): 993--1022
Sébastien Bubeck, Ulrike von Luxburg, "Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions", Journal of Machine Learning Research 10 (2009): 657--698
David S. Choi, Patrick J. Wolfe, "Co-clustering separately exchangeable network data", arxiv:1212.4093
Jon Kleinberg, "An Impossibility Theorem for Clustering", pp. 463--470 in NIPS 2002 [Thanks to Aaron Clauset for pointing me to this. But see Ackerman and Ben-David, above.]
Jeffrey W. Miller and Matthew T. Harrison [Moral: generally, the posterior distribution on the number of clusters in Bayesian clustering does not converge on the true number of clusters]
- "A simple example of Dirichlet process mixture inconsistency for the number of components", pp. 199--206 of Burges et al. (eds), NIPS 2013, arxiv:1301.2708
- "Inconsistency of Pitman-Yor Process Mixtures for the Number of Components", Journal of Machine Learning Research 15 (2014): 3333--3370
Rebecca Nugent and Werner Stuetzle, "Clustering with Confidence: A Low-Dimensional Binning Approach" [PDF]
Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly, "Inference of Population Structure Using Multilocus Genotype Data", Genetics 155 (2000): 945--959 [As "arthegall" pointed out on Pinboard (i, ii), this is the same model as in the famous Blei/Ng/Jordan paper, but applied to genes, rather than words.]
Parasaran Raman, Suresh Venkatasubramanian, "Power to the Points: Validating Data Memberships in Clusterings", arxiv:1305.4757
Matus Telgarsky and Sanjoy Dasgupta, "Moment-based Uniform Deviation Bounds for $k$-means and Friends", arxiv:1311.1903
Greg Ver Steeg, Aram Galstyan, "Discovering Structure in High-Dimensional Data Through Correlation Explanation", arxiv:1406.1222
Greg Ver Steeg, Aram Galstyan, Fei Sha, and Simon DeDeo, "Demystifying Information-Theoretic Clustering", arxiv:1310.4210

My lecture notes for my data mining class [However, many of them are based on lecture notes originally written by Tom Minka, and modesty does not forbid me from recommending his work.]

L. Angelini, L. Nitti, M. Pellicoro and Sebastiano Stramaglia, "Cost functions for pairwise data clustering," cond-mat/0103414
Arash A. Amini and Zahra S. Razaee, "Concentration of kernel matrices with application to kernel spectral clustering", Annals of Statistics 49 (2021): 531--556
Hendrik Blockeel, Luc De Raedt and Jan Ramon, "Top-down induction of clustering trees," cs.LG/0011032
Tamara Broderick, Michael I. Jordan, Jim Pitman, "Clusters and features from combinatorial stochastic processes", arxiv:1206.5862
Joachim M. Buhmann, "Information theoretic model validation for clustering", arxiv:1006.0375
Gunnar Carlsson, Facundo Mémoli, "Characterization, Stability and Convergence of Hierarchical Clustering Methods", Journal of Machine Learning Research 11 (2010): 1425--1470
Miguel Á. Carreira-Perpiñán, Weiran Wang, "The K-modes algorithm for clustering", arxiv:1304.6478
Michael Chichignoud, Sébastien Loustau, "Adaptive Noisy Clustering", arxiv:1306.2194
Ricardo Fraiman, Badih Ghattas, Marcela Svarc, "Clustering using Unsupervised Binary Trees: CUBT", arxiv:1011.2624
Clara Granell, Sergio Gomez, Alex Arenas, "Mesoscopic analysis of networks: applications to exploratory analysis and data clustering", arxiv:1101.1811
Stephen Jose Hanson, Malcolm Bauer, "Machine Learning, Clustering, and Polymorphy", arxiv:1304.3432 [Also UAI 1985]
Hanwen Huang, Yufeng Liu, Ming Yuan, J. S. Marron, "Statistical Significance of Clustering using Soft Thresholding", arxiv:1305.5879
Madjid Khalilian, Norwati Mustapha, "Data Stream Clustering: Challenges and Issues", arxiv:1006.5261
Daniel Korenblum and David Shalloway, "Macrostate Data Clustering", Physical Review E 67 (2003): 056704 [This sounds a lot like spectral clustering and diffusion maps]
Marta Luksza, Michael Lassig and Johannes Berg, "Significance Analysis and Statistical Mechanics: An Application to Clustering", Physical Review Letters 105 (2010): 220601
Marina Meila, Machine Learning 86 (2012): 369--389
Volodymyr Melnykov and Ranjan Maitra, "Finite mixture models and model-based clustering", Statistics Surveys 4 (2010): 80--116
Vladimir Nikulin and Geoffrey J. McLachlan, "Strong Consistency of Prototype Based Clustering in Probabilistic Space", arxiv:1004.3101 [The regularization they're using isn't at all obvious on first scan of the paper]
Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman, "Stability of Density-Based Clustering", arxiv:1011.2771
Alessandro Rinaldo and Larry Wasserman, "Generalized Density Clustering", Annals of Statistics 38 (2010): 2678--2722, arxiv:0907.3454
Oha Shamir and Naftali Tishby, "Stability and model selection in k-means clustering", Machine Learning 80 (2010): 213--243
Qing Song, "A Robust Information Clustering Algorithm", Neural Computation 17 (2005): 2672--2698 ["We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimization of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The maximization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main advantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density estimate and forms a simple data clustering algorithm based on the square error of the Euclidean distance."]
Susanne Still and William Bialek, "How many clusters? An information theoretic perspective," Neural Computation 16 (2004): 2483--2506, physics/0303011
Wei Sun, Junhui Wang, and Yixin Fang, "Regularized k-means clustering of high-dimensional data and its asymptotic consistency", Electronic Journal of Statistics 6 (2012): 148--167
Nguyen Xuan Vinh, Julien Epps, James Bailey, "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance", Journal of Machine Learning Research 11 (2010): 2837--2854
Ulrike von Luxburg, "A Tutorial on Spectral Clustering", arxiv::0711.0189
Ulrike von Luxburg, Mikhail Belkin, Olivier Bousquet, "Consistency of spectral clustering", arxiv:0804.0678
Junhui Wang, "Consistent selection of the number of clusters via crossvalidation", Biometrika 97 (2010): 893--904
Zhirong Yang and Erkki Oja, "Clustering by Low-Rank Doubly Stochastic Matrix Decomposition", arxiv:1206.4676