22 Oct 2014 19:37

A topic in data mining and statistics: given a big bunch of data points, assign them to a discrete set of groups in a way which somehow reflects the natural divisions among them, without knowing in advance what the groups are. This is the unsupervised counterpart to classification. (You see where the connection with induction comes in.) This is an important subject, but one of the topics I most dislike teaching in data mining, because the students' natural question is always "how do I know when my clustering algorithm is giving me a good solution?", and it's very hard to give them a reasonable answer. I think this is because most other data-mining problems are basically predictive, and so one can ask how good the prediction is; what's the best way to turn clustering into a prediction problem? (Probabilistic mixture models suggest themselves, of course.)

See also: Mixture Models; Classifiers and Clustering for Time Series