## Clustering

*27 Feb 2017 16:30*

A topic in data mining and
statistics: given a big bunch of data points,
assign them to a discrete set of groups in a way which somehow reflects the
natural divisions among them, without knowing in advance what the groups are.
This is the unsupervised counterpart to classification. (You see where the
connection with induction comes
in.) This is an important subject, but one of the topics I most dislike
teaching in data mining, because the students' natural question is always "how
do I know when my clustering algorithm is giving me a *good* solution?",
and it's very hard to give them a reasonable answer. I think this is because
most other data-mining problems are basically *predictive*, and so one
can ask how good the prediction is; what's the best way to turn clustering into
a prediction problem? (Probabilistic mixture models suggest themselves, of
course.)

See also: Mixture Models; Classifiers and Clustering for Time Series

- Recommended, big picture:
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
- Isabelle Guyon, Ulrike von Luxburg, and Robert C. Williamson, "Clustering: Science or Art?" [Online PDF]
- Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data

- Recommended, close-ups:
- Margaret Ackerman and Shai Ben-David, "Measures of Clustering Quality: A Working Set of Axioms for Clustering", NIPS 2008 [PDF. A rebuttal to Kleinberg's paper on clustering. Thanks to "arthegall" and Ed Vielmetti for pointers.]
- David Arthur and Sergei Vassilvitskii, "
`k-means++`: The Advantages of Careful Seeding" [PDF preprint via Dr. Arthur] - David M. Blei, Andrew Y. Ng and Michael I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research
**3**(2003): 993--1022 - Sébastien Bubeck, Ulrike von Luxburg, "Nearest Neighbor
Clustering: A Baseline Method for Consistent Clustering with Arbitrary
Objective
Functions", Journal of
Machine Learning Research
**10**(2009): 657--698 - David S. Choi, Patrick J. Wolfe, "Co-clustering separately exchangeable network data", arxiv:1212.4093
- Jon Kleinberg, "An Impossibility Theorem for Clustering", Advances in Neural Information Processing Systems 15 (NIPS 2002) [Thanks to Aaron Clauset for pointing me to this. PDF from Kleinberg and from NIPS. But see Ackerman and Ben-David, above.]
- Rebecca Nugent and Werner Stuetzle, "Clustering with Confidence: A Low-Dimensional Binning Approach" [PDF]
- Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly,
"Inference of Population Structure Using Multilocus Genotype
Data", Genetics
**155**(2000): 945--959 [As "arthegall" pointed out on Pinboard (i, ii), this is the same model as in the famous Blei/Ng/Jordan paper, but applied to genes, rather than words.] - Parasaran Raman, Suresh Venkatasubramanian, "Power to the Points: Validating Data Memberships in Clusterings", arxiv:1305.4757
- Matus Telgarsky and Sanjoy Dasgupta, "Moment-based Uniform Deviation Bounds for $k$-means and Friends", arxiv:1311.1903
- Greg Ver Steeg, Aram Galstyan, "Discovering Structure in High-Dimensional Data Through Correlation Explanation", arxiv:1406.1222
- Greg Ver Steeg, Aram Galstyan, Fei Sha, and Simon DeDeo, "Demystifying Information-Theoretic Clustering", arxiv:1310.4210

- Modesty forbids me to recommend:
- My lecture notes for my data mining class [However, many of them are based on lecture notes originally written by Tom Minka, and modesty does not forbid me from recommending his work.]

- To read:
- L. Angelini, L. Nitti, M. Pellicoro and Sebastiano Stramaglia, "Cost functions for pairwise data clustering," cond-mat/0103414
- Hendrik Blockeel, Luc De Raedt and Jan Ramon, "Top-down induction of clustering trees," cs.LG/0011032
- Tamara Broderick, Michael I. Jordan, Jim Pitman, "Clusters and features from combinatorial stochastic processes", arxiv:1206.5862
- Joachim M. Buhmann, "Information theoretic model validation for clustering", arxiv:1006.0375
- Gunnar Carlsson, Facundo Mémoli, "Characterization,
Stability and Convergence of Hierarchical Clustering
Methods", Journal
of Machine Learning Research
**11**(2010): 1425--1470 - Miguel Á. Carreira-Perpiñán, Weiran Wang, "The K-modes algorithm for clustering", arxiv:1304.6478
- Michael Chichignoud, Sébastien Loustau, "Adaptive Noisy Clustering", arxiv:1306.2194
- Ricardo Fraiman, Badih Ghattas, Marcela Svarc, "Clustering using Unsupervised Binary Trees: CUBT", arxiv:1011.2624
- Clara Granell, Sergio Gomez, Alex Arenas, "Mesoscopic analysis of networks: applications to exploratory analysis and data clustering", arxiv:1101.1811
- Stephen Jose Hanson, Malcolm Bauer, "Machine Learning, Clustering, and Polymorphy", arxiv:1304.3432 [Also UAI 1985]
- Hanwen Huang, Yufeng Liu, Ming Yuan, J. S. Marron, "Statistical Significance of Clustering using Soft Thresholding", arxiv:1305.5879
- Madjid Khalilian, Norwati Mustapha, "Data Stream Clustering: Challenges and Issues", arxiv:1006.5261
- Daniel Korenblum and David Shalloway, "Macrostate Data Clustering",
Physical Review E
**67**(2003): 056704 [This sounds a lot like spectral clustering and diffusion maps] - Marta Luksza, Michael Lassig and Johannes Berg, "Significance Analysis and Statistical Mechanics: An Application to Clustering", Physical Review Letters
**105**(2010): 220601 - Marina
Meila, Machine
Learning
**86**(2012): 369--389 - Volodymyr Melnykov and Ranjan Maitra, "Finite mixture models and model-based clustering", Statistics Surveys
**4**(2010): 80--116 - Vladimir Nikulin and Geoffrey J. McLachlan, "Strong Consistency of Prototype Based Clustering in Probabilistic Space", arxiv:1004.3101 [The regularization they're using isn't at all obvious on first scan of the paper]
- Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman, "Stability of Density-Based Clustering", arxiv:1011.2771
- Alessandro Rinaldo and Larry Wasserman, "Generalized Density
Clustering", Annals of Statistics
**38**(2010): 2678--2722, arxiv:0907.3454 - Oha Shamir and Naftali Tishby, "Stability and model selection in k-means clustering", Machine Learning
**80**(2010): 213--243 - Qing Song, "A Robust Information Clustering Algorithm",
Neural
Computation
**17**(2005): 2672--2698 ["We focus on the scenario of robust information clustering (RIC) based on the minimax optimization of mutual information (MI). The minimization of MI leads to the standard mass-constrained deterministic annealing clustering, which is an empirical risk-minimization algorithm. The maximization of MI works out an upper bound of the empirical risk via the identification of outliers (noisy data points). Furthermore, we estimate the real risk VC-bound and determine an optimal cluster number of the RIC based on the structural risk-minimization principle. One of the main advantages of the minimax optimization of MI is that it is a nonparametric approach, which identifies the outliers through the robust density estimate and forms a simple data clustering algorithm based on the square error of the Euclidean distance."] - Susanne Still and William Bialek, "How many clusters? An
information theoretic
perspective," Neural
Computation
**16**(2004): 2483--2506, physics/0303011 - Wei Sun, Junhui Wang, and Yixin Fang, "Regularized k-means
clustering of high-dimensional data and its asymptotic
consistency", Electronic
Journal of Statistics
**6**(2012): 148--167 - Nguyen Xuan Vinh, Julien Epps, James Bailey, "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance", Journal of Machine Learning Research
**11**(2010): 2837--2854 - Ulrike von Luxburg, "A Tutorial on Spectral Clustering", arxiv::0711.0189
- Ulrike von Luxburg, Mikhail Belkin, Olivier Bousquet, "Consistency of spectral clustering", arxiv:0804.0678
- Junhui Wang, "Consistent selection of the number of clusters via crossvalidation", Biometrika
**97**(2010): 893--904 - Zhirong Yang and Erkki Oja, "Clustering by Low-Rank Doubly Stochastic Matrix Decomposition", arxiv:1206.4676