Statistics 36-350: Data Mining (Fall 2008)
Since class begins Monday, this is a good time for
the public website to make
its appearance. As before, lecture notes will also be
posted here; you can use the RSS feed for this entry to keep track of them.
- Introduction to the course (25 August)
- Information retrieval and similarity searching (25 August)
- Multidimensional scaling and a first glance and classification (27 August)
- A little about page-rank (29 August)
Homework #1, due 8 September: assignment, R, newsgroups.tgz data file
Solutions
- Image search, abstraction and invariance; the accompanying slides (8 September)
- Finding informative features (10 September)
Additional reading: David Feldman, "Introduction to Information Theory", chapter 1
- Information and interaction among
features (12 September)
Additional reading: Aleks Jakulin and Ivan Bratko, "Quantifying and
Visualizing Attribute Interactions", arxiv:cs.AI/0308002
Homework #2, due 22 September: assignment
solutions, solutions code
Note:
Information theory, axiomatic foundations, connections to statistics
— elaboration on some points raised in lecture (12 September)
- Categorization: types of categorization, basic classifiers and finding simple clusters in data (15 September)
- Hierarchical clustering; how many clusters? (17 September)
- Yet
more clustering (19 September; slides)
- Making better features: transformations, principal components (22 September)
- Mathematics
of principal components analysis; interpretations and limitations of PCA
(24 September)
- Yet
more on linear dimensionality reduction: PCA + information retrieval =
Latent semantic indexing. Factor analysis: motivations, historical roots,
preliminaries to estimation (26 September)
Optional reading: Deerwester et al., "Indexing by Latent Semantic Analysis" [PDF]
Optional reading: Landauer and Dumais, "A Solution to Plato's Problem: The Latent Semantic
Analysis Theory of Acquisition, Induction, and Representation of Knowledge"
[PDF]
Optional reading: Thurstone, "The Vectors of Mind"
Home #3, due 3 October: assignment
- More on factor analysis: estimation and the rotation
problem (29 September)
- Principal
Components versus Factor Analysis: worked examples, basic goodness-of-fit
testing for factor
analysis; R code
for lecture (1 October)
- The truth about principal components
and factor analysis: strengths, limitations, factor models as graphical
models, factor models and mixture models, Thomson's sampling model; R code for Thomson's model (3 October)
Homework #4, due Friday, 10 October: assignment, nci.kmeans, nci.pca2.kmeans
- Regression:
predicting quantiative features: point prediction; expectations and
mean-square optimality; regression functions; regression as smoothing; linear
regression as linear smoothing; other kinds of linear smoothers;
nearest-neighbor regression; kernel
regression. R
code for
figures, data for
running example (6 October)
- The
truth about linear regression: optimal linear prediction; shifting
distributions and omitted variables; rights and obligations of probabilistic
assumptions; abuses of linear regression; how to hurt angels (8 October)
- Extending linear regression:
weighted least-squares, heteroskedasticity, local linear
regression. R code for
figures, data for
running example (10 October)
- Mid-term review (13 October; no hand-out)
- Mid-term: exam, solutions (15 October)
- Evaluating preditive models:
in-sample and generalization error; over-fitting and under-fitting; model
selection, capacity control,
cross-validation. R for figures. (20
October)
- Using cross-validation: mechanics and examples (22 October; notes forthcoming)
- Using non-parametric smoothing: adaptive smoothing, testing parametric
forms (24 October; notes forthcoming)
Homework #5, due Friday, 31 October: assignment; solutions
- Prediction
trees 1: mostly regression trees, plus a "classification tree we can
believe in" (27 October)
- Prediction trees 2: classification trees (29 October and 3 November)
- Bootstrapping, Bagging, and Random Forests (5 November)
- Combining Predictive Models and the Power of Diversity (7 November)
- Linear Classifiers and the Perceptron Algorithm (10 November)
- Logistic Regression and Newton's Method (12 November)
Homework #7, due Friday, 21 November: assignment;
solutions
- Neural Networks: The Mathematical Reality (14 November)
- Neural Networks: The Biological Myth (17 November)
- Support Vector Machines (19 November)
- Support vector machines continued (21 November; same handout as previous)
Homework #8, due Monday, 1 December: assignment; solutions
- The Lecture Full of Fail: The wrong data, lying data, covariate shift,
low base-rates and overwhelming false positives, response
Waste, fraud and abuse (24 November)
Homework #9, due 15 December: assignment;
solutions
Corrupting the
Young;
Enigmas of
Chance
Posted at December 28, 2008 10:49 | permanent link