36-350, Data Mining: Course Materials (Fall 2009)
My lesson-plan having survived first contact with
the enemy students, it's time to start posting the lecture
handouts & c.  This page will be updated as the semester goes on; the RSS
feed for it should be here.
The class homepage has more
information.
-  Introduction
to the course (24 August) What is data mining? how is it used? where did it
come from? Some themes.
-  Information
retrieval and similarity searching I (26 August) Finding the data you are
looking for.  Ideas we will avoid: meta-data and cataloging; meanings.  Textual
features.  The bag-of-words representation; its vector form.  Measuring
similarity and distance for vectors.  Example with the New York Times
Annotated Corpus.
-  IR continued (28 August).  The
trick to searching: queries are documents.  Search evaluation: precision,
recall, precision-recall curves; error rates.  Classification: nearest
neighbors and prototypes; classifier evaluation by mis-classification rate and
by confusion matrices.  Inverse document frequency weighting.  Visualizing
high-dimensional data by multi-dimensional scaling.  Miscellaneous topics:
stemming, incorporating user feedback.
Homework 1, due 4 September: assignment,
R, data; solutions
 
-  Page
Rank (31 August).  Links as pre-existing feedback.  How to exploit link
information?  The random walk on the graph; using the ergodic theorem.
Eigenvector formulation of page-rank.  Combining page-rank with textual
features.  Other applications.  Further reading on information retrieval.
-  Image
Search, Abstraction and Invariance (2 September).  Similarity search for
images.  Back to representation design.  The advantages of abstraction:
simplification, recycling.  The bag-of-colors representation.  Examples.
Invariants.  Searching for images by searching text.  An example in practice.
Slides for this lecture.
-  Information
Theory I (4 September).  Good features help us guess what we can't
represent.  Good features discriminate between different values of unobserved
variables.  Quantifying uncertainty with entropy.  Quantifying reduction in
uncertainty/ discrimination with mutual information.  Ranking features based on
mutual information.  Examples, with code, of informative words for
the Times.  Code.
 Supplementary reading: David
P. Feldman, Brief Tutorial on
Information Theory, chapter 1Homework 2, due 11 September: assignment; solutions
text
 and R code
 
-  Information Theory II (9
September).  Dealing with multiple features.  Joint entropy, the chain rule for
entropy.  Information in multiple features.  Conditional information, chain
rule for information, conditional independence.  Interactions, positive and
negative, and redundancy.  Greedy feature selection with low redundancy.
Example, with code, of selecting words for the Times.  Sufficient
statistics and the information
bottleneck. Code.
 Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and
Visualizing Attribute
Interactions", arxiv:cs.AI/0308002
-  Categorization;
Clustering I (11 September).  Dividing the world up into categories.
Classification: known categories with labeled examples.  Taxonomy of learning
problems (supervised, unsupervised, semi-supervised, feedback, ...).
Clustering: discovering unknown categories from unlabeled data.  Benefits of
clustering, with an digression on where official classes come from.  Basic
criterion for good clusters: lots of information about features from little
information about cluster.  Practical considerations: compactness, separation,
parsimony, balance.  Doubts about parsimony and balance.  The k-means
clustering algorithm, or unlabeled prototype classification: analysis,
geometry, search.  Appendix: geometric aspects of the prototype and
nearest-neighbor method.
Homework 3, due 18
September: assignment; solutions
 
-  Clustering II (14 September).
Distances between partitions; variation-of-information distance.
Hierarchical clustering by agglomeration and its varieties.  Picking the
number of clusters by merging costs.  Performance of different clustering
methods on various doodles.  Why we would like to pick the number of clusters
by predictive performance, and why it is hard to do at this stage.  Reifying clusters.
-  Transformations: Rescaling and
Low-Dimensional Summaries (16 September).  Improving on our original
features.  Re-scaling, standardization, taking logs, etc., of individual
features.  Forcing things to be Gaussian considered harmful.  Low-dimensional
summaries by combining features.  Exploiting geometry to eliminate redundancy.
Projections on to linear subspaces.  Searching for structure-preserving
projections.
-  Principal Components I (18
September).  Principal components are the directions of maximum variance.
Derivation of principal components as the best approximation to the data in a
linear subspace.  Equivalence to variance maximization.  Avoiding explicit
optimization by finding eigenvalues and eigenvectors of the covariance matrix.
Example of principal components with cars; how to tell a sports car from a
minivan.  The standard recipe for doing PCA.  Cautions in interpreting
PCA.  Data-set used in the notes.
Homework 4, due 25
September: assignment; solutions
 
-  Principal
Components II (21 September).  PCA + information retrieval = latent
semantic indexing; why LSI is a Good Idea.  PCA and multidimensional scaling.
-  Factor
Analysis (23 and 25 September).  From PCA to factor analysis by adding
noise.  Roots of factor analysis in causal discovery: Spearman's general factor
model and the tetrad equations.  Problems with estimating factor models: number
of equations does not equal number of unknowns.  Solution 1, "principal
factors", a.k.a. estimation through heroic feats of linear algebra.  Solution
2, maximum likelihood, a.k.a. estimation through imposing distributional
assumptions.  The rotation problem: the factor model is
unidentifiable; the number of factors may be meaningful, but the individual
factors are not.
-  The
Truth about PCA and Factor Analysis (28 September) PCA is data reduction
without any probabilistic assumptions about where the data came from.  Picking
number of components.  Faking predictions from PCA.  Factor analysis makes
stronger, probabilistic assumptions, and delivers stronger, predictive
conclusions --- which could be wrong.  Using probabilistic assumptions and/or
predictions to pick how many factors.  Factor analysis as a first, toy
instances of a graphical causal model.  The rotation problem once more with
feeling.  Factor models and mixture models.  Factor models and Thomson's
sampling model: an outstanding fit to a model with a few factors is actually
evidence of a huge number of badly measured latent variables.
Final advice: it all depends, but if you can only do one, try PCA.
R
code for the Thomson sampling model.
-  Nonlinear
Dimensionality Reduction I: Locally Linear Embedding (5 October).  Failure
of PCA and all other linear methods for nonlinear structures in data; spirals,
for example.  Approximate success of linear methods on small parts of nonlinear
structures.  Manifolds: smoothly curved surfaces embedded in higher-dimensional
Euclidean spaces.  Every manifold looks like a linear subspace on a
sufficiently small scale, so we should be able to patch together many small
local linear approximations into a global manifold.  Local linear embedding:
approximate each vector in the data as a weighted linear combination of
its k nearest neighbors, then find the low-dimensional vectors best
reconstructed by these weights.  Solving the optimization problems by linear
algebra.  Coding up LLE.  A spiral
rainbow.  R.
- Nonlinear
Dimensionality Reduction II: Diffusion Maps (9 October).  Making a graph
from the data; random walks on this graph.  The diffusion operator,
a.k.a. Laplacian.  How the Laplacian encodes the shape of the data.
Eigenvectors of the Laplacian as coordinates.  Connection to page-rank.
Advantages when data are not actually on a manifold.  Example.
Pre-midterm review (12 October): highlights of the course to date; no
handout.
 MIDTERM (14
October): exam, solutions
 Homework 5, due 23 October:
assignment;
solutions
 
- Regression
I: Basics.  Guessing a real-valued random variable; why expectation values
are mean-square optimal point forecasts.  The regression function; why its
estimation must involve assumptions beyond the data.  The bias-variance
decomposition and the bias-variance trade-off.  First example of improving
prediction by introducing variance.  Ordinary least squares linear regression
as smoothing.  Other linear smoothers: k-nearest-neighbors and kernel
regression.  How much should we
smooth?  R, data
for running example
-  Regression
II: The Truth About Linear Regression (21 October).  Linear regression is
optimal linear (mean-square) prediction; we do this because we hope a linear
approximation will work well enough over a small range.  What linear regression
does: decorrelate the input features, then correlate them separately with the
response and add up.  The extreme weakness of the probabilistic assumptions
needed for this to make sense.  Difficulties of linear regression;
collinearity, errors in variables, shifting distributions of inputs, omitted
variables.  The usual extra probabilistic assumptions and their implications.
Why you should always looking at residuals.  Why you generally shouldn't use
regression for causal inference.  How to torment angels.  Likelihood-ratio
tests for restrictions of nice models.
-  Regression III: Extending Linear
Regression (23 October).  Weighted least squares.  Heteroskedasticity:
variance is not the same everywhere.  Going to consult the oracle.  Weighted
least squares as a solution to heteroskedasticity.  Nonparametric estimation of
the variance function.  Local polynomial regression: local constants (= kernel
regression), local linear regression, higher-order local polynomials.  Lowess =
locally-linear smoothing for scatter plots.  The oracles fall silent.
Homework 6, due Friday, 30 October: assignment, data set; solutions
 
-  Evaluating Predictive Models (26
and 28 October).  In-sample, out-of-sample and generalization loss or error;
risk as expected loss on new data.  Under-fitting, over-fitting, and examples
with polynomials.  Methods of model selection and controlling over-fitting:
empirical risk minimization, penalization, constraints/sieves, formal learning
theory, cross-validation.  Limits of
generalization.  R for creating figures.
-  Smoothing
Methods in Regression (30 October).  How much smoothing should we do?
Approximation by local averaging.  How much smoothing we should do to
find the unknown curve depends on how smooth the curve really is,
which is unknown.  Adaptation as a partial substitute for actual knowledge.
Cross-validation for adapting to unknown smoothness.  Application: testing
parametric regression models by comparing them to nonparametric fits.  The
bootstrap principle.  Why ever bother with parametric
regressions?  R
code for some of the examples.
Homework 7, due Friday, 6
November: assignment;
solutions: text
and code
 
-  Additive
Models (2 November).  A nice feature of linear models: partial responses,
partial residuals, and backfitting estimations.  Additive models: regression
curve is a sum of partial response functions; partial residuals and the
backfitting trick generalize.  Parametric and non-parametric rates of
convergence.  The curse of dimensionality for unstructured nonparametric
models.  Additive models as a compromise, introducing bias to reduce variance.
Example with the data from homework 6.
-  Classification
and Regression Trees (4 and 6 November).  Prediction trees.  A
classification tree we can believe in.  Prediction trees combine simple local
models with recursive partitioning; adaptive nearest neighbors.  Regression
trees: example; a little math; pruning by cross-validation; more R mechanics.
Classification trees: basics; measuring error by mis-classification; weighted
errors; likelihood; Neyman-Pearson classifiers.  Uncertainty for trees.
Homework 8, due 5 pm on Monday, 16 November: assignment; solutions; R for solutions
 
-  Combining Models 1: Bagging and Model Averaging (9 November)
-  Combining Models 2: Diversity and Boosting (11 November)
-  Linear Classifiers (16 November).
Geometry of linear classifiers.  The perceptron algorithm for learning linear
classifiers.  The idea of "margin".
-  Logistic Regression (18
November).  Attaching probabilities to linear classifiers: why would we want
to?  why would we use the logistic transform to do so?  More-than-binary
logistic regression. Maximizing the likelihood; Newton's method for
optimization.  Generalized linear models and generalized additive models;
testing GLM specifications with GAMs.
-  Support Vector Machines (20
November).  Turning nonlinear problems into linear ones by expanding into
high-dimensional feature spaces.  The dual representation of linear
classifiers: weight training points, not features.  Observation: in the dual
representation, only inner products of vectors matter.  The kernel trick:
kernel functions let us compute inner products in feature spaces without
computing the features.  Some bounds on the generalization error of linear
classifiers based on "margin" and the number of training points with non-zero
weight ("support vectors").  Learning support vector machines by trading
in-sample performance against bounds on over-fitting.
Homework 9, due at 5 pm on Monday, 30 November: assignment
 
-  Density Estimation (23 November).
Histograms as distribution estimates.  Glivenko-Cantelli, "the fundamental
theorem of statistics".  Histograms as density estimates; selecting density
estimates by cross-validation.  Kernel density estimates.  Why kernels are
better than histograms.  Curse of dimensionality again.  Hint at alternatives
to kernel density estimates.
-  Mixture Models, Latent Variables and
the EM Algorithm (30 November).  Compressing and restricting density
estimates.  Mixtures of limited numbers of distributions.  Mixture models as
probabilistic clustering; finally an answer to "how many clusters?"  The EM
algorithm as an iterative way of maximizing likelihood with latent variables.
Analogy to k-means.  More theory of the EM algorithm.  Applications: density
mixtures, signal processing/state estimation, mixtures of regressions, mixtures
of experts; topic models and probabilistic latent semantic analysis.  A glance
at non-parametric mixture models.
-  Graphical Causal Models (2 December).  Distinction between causation and
association, and between causal and probabilistic prediction.  Some examples.
Directed acyclic graphs and causal models.  The Markov property.  Conditional
independence via separation.  Faithfulness.
-  Causal Inference (4 December).
Estimating causal effects; control for confounding.  Discovering causal
structure: the SGS algorithm and its variants.  Limitations.
Take-home final exam, due 15
December: assignment; data
sets: expressdb_cleaned
(20
Mb), HuIyer_TFKO_expression
(20 Mb).  With great thanks to Dr. Timothy Danford.
 
Corrupting the Young;
Enigmas of Chance
 
Posted at December 05, 2009 14:39 | permanent link