September 16, 2006

Data Mining (36-350) Lecture Notes, Weeks 1--3

These handouts are shamelessly ripped off derivative work, amplifying and expanding those created by Tom Minka when he invented this course. (See his originals here.) Posted here in response to a number (> 1) of requests.

Lecture 5 is also a shameless rip-off explication of Aleks Jakulin's "Quantifying and Visualizing Attribute Interactions" (cs.AI/0308002).

Note to students in 36-350: This page will not keep up to date with the handouts, or with other course documents; use Blackboard!

  1. Searching Documents by Similarity (28 August 2006). Why similarity search? Defining similarity and distance. The bag-of-words representation. Normalizations. Some results.
  2. More on Similarity Search (30 August 2006). Stemming, linguistic issues. Picking out good features, or at least ignoring non-discriminative ones. Inverse document frequency. Using feedback from the searcher.
  3. Searching Images by Similarity (6 September 2006). Representation and abstraction. How to search images without looking at images; a failure-mode. The bag-of-colors representation. More examples. Invariance and representation. See also: slides illustrating this lecture.
  4. Finding Informative Features (11--13 September 2006). More on finding good features. Entropy and uncertainty. Information and entropy. Ranking features by informativeness. Examples.
  5. Interactions Among Features (18 September 2006). Redundancy and enhancement of information. Information-sharing graphs. Examples.

Corrupting the Young; Enigmas of Chance

Posted at September 16, 2006 12:56 | permanent link

Three-Toed Sloth