Data Mining (36-350) Lecture Notes, Weeks 1--3
These handouts are shamelessly ripped off derivative work,
amplifying and expanding those created
by Tom Minka when he
invented this course.  (See his
originals here.)  Posted
here in response to a number (> 1) of requests.
Lecture 5 is also a shameless rip-off explication
of Aleks Jakulin's
"Quantifying and Visualizing Attribute Interactions"
(cs.AI/0308002).
Note to students in 36-350: This page will not keep up to date with
the handouts, or with other course documents; use Blackboard!
- Searching
Documents by Similarity (28 August 2006). Why similarity search? Defining
similarity and distance.  The bag-of-words representation.  Normalizations.
Some results.
- More on
Similarity Search (30 August 2006). Stemming, linguistic issues.  Picking
out good features, or at least ignoring non-discriminative ones.  Inverse
document frequency.  Using feedback from the searcher.
- Searching
Images by Similarity (6 September 2006). Representation and
abstraction. How to search images without looking at images; a failure-mode.
The bag-of-colors representation.  More examples.  Invariance and
representation.  See also: slides
illustrating this lecture.
- Finding
Informative Features (11--13 September 2006). More on finding good features.
Entropy and uncertainty.  Information and entropy.  Ranking features by
informativeness.  Examples.
- Interactions
Among Features (18 September 2006). Redundancy and enhancement of
information.  Information-sharing graphs.  Examples.
Corrupting the
Young;
Enigmas of
Chance
 
Posted at September 16, 2006 12:56 | permanent link