### Data Mining (36-350) Lecture Notes, Weeks 4--7

These handouts are ~~shamelessly ripped off~~ derivative work,
amplifying and expanding those created
by Tom Minka when he
invented this course. (See his
originals here.) Posted
here in response to a number (> 1) of requests.
See here for the first three
weeks' handouts.

Note to students in 36-350: This page will *not* keep up to date with
the handouts, or with other course documents; use Blackboard!

- September 20 and 25 (Lecture 6): Partitioning Data
into Clusters. Supervised and unsupervised learning. Social and
organizational aspects of categorization. Finding categories in data via
clustering. Characteristics of good clusters. The k-means algorithm for
clustering. Search algorithms, search landscapes, hill climbing, local minima.
Algorithms for hierarchical clustering. Avoiding spherical clusters. See
also: slides to accompany the
second half, showing clustering of images.
- September 27 (Lecture 7): Making Better
Features. Transforming features to enhance invariance. Transforming
features to improve their distribution. Projecting high-dimensional data into
lower dimensions. Principal component analysis: informal description and
example.
- October 2 (Lecture 8): More on Principal Component
Analysis. Mathematical basis: maximizing the variance of the projected
points. Mathematical basis: minimizing reconstruction error. Interpretation
of PCA results.
- October 4: Review of course to date. (No handout.)
- October 9 (Lecture 9): Evaluating Predictive
Models. Classification and linear regression as examples of predictive
modeling. Error measures a.k.a. loss functions; examples. In-sample error.
Out-of-sample or generalization error; why it matters, relation to in-sample
error. Model selection. An example of over-fitting. Approaches to limiting
over-fitting and its ill effects.
- October 11 (Lecture 10): Regression Trees.
Difficulties of fitting global models in complex systems. Recursive
partitioning and simple local models as a solution. Prediction trees in
general. Regression trees in particular. An example. Tree growing. Tree
pruning via cross-validation.

Corrupting the
Young;
Enigmas of
Chance

Posted at October 12, 2006 11:40 | permanent link