Notebooks

## Data Mining

17 May 2021 10:41

I've taught a course on this, so I ought to be able to describe it, oughtn't I? Data mining, more stuffily "knowledge discovery in databases", is the art of finding and extracting useful patterns in very large collections of data. It's not quite the same as machine learning, because, while it certainly uses ML techniques, the aim is to directly guide action (praxis!), rather than to develop a technology and theory of induction. In some ways, in fact, it's closer to what statistics calls "exploratory data analysis", though with certain advantages and limitations that come from having really big data to explore.

Kernel methods get their own notebook.

Ethical and political issues in data mining definitely deserve their own notebook.

Recommended, big picture:
• Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16 (2001): 199--231 [very much including the discussion by others and the reply by Breiman]
• Pedro Domingos, "A Few Useful Things to Know about Machine Learning"
• David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
• Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Website, with full text free in PDF]
• Cathy O'Neil, Weapons of Math Destruction
• Sholom M. Weiss and Nitin Indrukyha, Predictive Data Mining: A Practical Guide [Pedestrian, but it is practical, and adapted to the meanest, i.e. the managerial, understanding]
Recommended, close-ups:
• Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley, "Underspecification Presents Challenges for Credibility in Modern Machine Learning", arxiv:2011.03395
• Jesse Davis and Mark Goadrich, "The Relationship Between Precision-Recall and ROC Curves" [PDF preprint]
• Sharad Goel, Jake M. Hofman, Sébastien Lahaie, David M. Pennock, and Duncan J. Watts, "Predicting consumer behavior with Web search", Proceedings of the National Academy of Sciences (USA) 107 (2010): 17486--17490 [A case study in using data mining, while recognizing limitations]
• Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
• Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data,/cite>
• Jon Kleinberg, Christos Papadimitriou and Prabhakar Raghavan, "A Microeconomic View of Data Mining", Data Mining and Knowledge Discovery 2 (1998) [PDF]
• Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, Michael I. Jordan, "A Scalable Bootstrap for Massive Data", arxiv:1112.5016
• Kling, Scherson and Allen, "Parallel Computing and Information Capitalism," in Metropolis and Rota (eds.), A New Era in Computation (1992) [A batch of UC Irvine comp. sci. professors who write like sociologists. " `Information capitalism' refers to forms of organization in which data-intensive techniques and computerization are key strategic resources for corporate production."]
• Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets
• R. Dean Malmgren, Jake M. Hofman, Luis A. N. Amaral, Duncan J. Watts, "Characterizing Individual Communication Patterns", arxiv:0905.0106
• John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis
• Ryan J. Tibshirani, "Degrees of Freedom and Model Search", arxiv:1402.1920
• Yong Wang, Ilze Ziedins, Mark Holmes, Neal Challands, "Tree Models for Difference and Change Detection in a Complex Environment", Annals of Applied Statistics 6 (2012): 1162--1184, arxiv:1202.1561 [In an ordinary classification tree, we are interested in the distribution of the class labels $Y$ given the predictors $X$, i.e., $\Pr(Y|X)$, and make splits on $X$ so that (in essence) the conditional entropy $H[Y|X]$ becomes small. This is of course equivalent to making splits so that the divergence of $Pr(Y|X)$ from $Pr(Y)$ is maximized. What they are interested in is not classification but describing how the different classes are distinct, so the relevant distribution is $Pr(X|Y)$, and they want a big divergence between $Pr(X)$ and $Pr(X|Y)$.]
• Jianming Ye, "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical Association 93 (1998): 120--131
Modesty forbids me to recommend:
• My lecture notes for my data mining class [I learned a lot, in writing these, from notes from the previous version of the course written by Tom Minka, and modesty does not forbid me from recommending his work.]