(My notes for this lecture are too fragmentary to post. What follows is the sketch.)
The "raw data" is often not in the format most useful for the model one wants to work with. Lots of statistical computing work is about moving the information from one format to another --- about changing representations. Lossless transformations vs. lossy; why we often want lossy transformations. Re-organizing data to group it properly. (Example: going from multi-dimensional arrays to 2D data-frames and vice versa.) Aggregation as a change of representation. (Example: Going from dates of adoption for each doctor to cumulative proportion of adopters.)
Text processing via change of representation: the bag-of-words ("vector space") representation. Cosine and Jaccard similarities. Term frequency-inverse document frequency. Document clustering and classification.
Readings: Spector, chapters 8 and 9.
On text: Lectures 1, 2 and 4 (+ slides) from data mining (vintage 2009).
Posted at November 18, 2013 10:30 | permanent link