Notebooks

## Data Mining

22 Sep 2015 19:41

I've taught a course on this, so I ought to be able to describe it, oughtn't I? Data mining, more stuffily "knowledge discovery in databases", is the art of finding and extracting useful patterns in very large collections of data. It's not quite the same as machine learning, because, while it certainly uses ML techniques, the aim is to directly guide action (praxis!), rather than to develop a technology and theory of induction. In some ways, in fact, it's closer to what statistics calls "exploratory data analysis", though with certain advantages and limitations that come from having really big data to explore.

Kernel methods probably deserve their own notebook.

Recommended, close-ups:
Yong Wang, Ilze Ziedins, Mark Holmes, Neal Challands, "Tree Models for Difference and Change Detection in a Complex Environment", Annals of Applied Statistics 6 (2012): 1162--1184, arxiv:1202.1561 [In an ordinary classification tree, we are interested in the distribution of the class labels $Y$ given the predictors $X$, i.e., $\Pr(Y|X)$, and make splits on $X$ so that (in essence) the conditional entropy $H[Y|X]$ becomes small. This is of course equivalent to making splits so that the divergence of $Pr(Y|X)$ from $Pr(Y)$ is maximized. What they are interested in is not classification but describing how the different classes are distinct, so the relevant distribution is $Pr(X|Y)$, and they want a big divergence between $Pr(X)$ and $Pr(X|Y)$.]
