Data Mining

Last update: 30 Jun 2025 10:59
First version: 24 August 2006

I've taught a course on this, so I ought to be able to describe it, oughtn't I? Data mining, more stuffily "knowledge discovery in databases", is the art of finding and extracting useful patterns in very large collections of data. It's not quite the same as machine learning, because, while it certainly uses ML techniques, the aim is to directly guide action (praxis!), rather than to develop a technology and theory of induction. In some ways, in fact, it's closer to what statistics calls "exploratory data analysis", though with certain advantages and limitations that come from having really big data to explore.

Kernel methods and nearest neighbors get their own notebooks.

Ethical and political issues in data mining definitely deserve their own notebook.

Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16 (2001): 199--231 [very much including the discussion by others and the reply by Breiman]
Pedro Domingos, "A Few Useful Things to Know about Machine Learning"
David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Website, with full text free in PDF]
Cathy O'Neil, Weapons of Math Destruction
Sholom M. Weiss and Nitin Indrukyha, Predictive Data Mining: A Practical Guide [Pedestrian, but it is practical, and adapted to the meanest, i.e. the managerial, understanding]

Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley, "Underspecification Presents Challenges for Credibility in Modern Machine Learning", arxiv:2011.03395
Jesse Davis and Mark Goadrich, "The Relationship Between Precision-Recall and ROC Curves" [PDF preprint]
Sharad Goel, Jake M. Hofman, Sébastien Lahaie, David M. Pennock, and Duncan J. Watts, "Predicting consumer behavior with Web search", Proceedings of the National Academy of Sciences (USA) 107 (2010): 17486--17490 [A case study in using data mining, while recognizing limitations]
Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data,/cite>
Jon Kleinberg, Christos Papadimitriou and Prabhakar Raghavan, "A Microeconomic View of Data Mining", Data Mining and Knowledge Discovery 2 (1998) [PDF]
Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, Michael I. Jordan, "A Scalable Bootstrap for Massive Data", arxiv:1112.5016
Kling, Scherson and Allen, "Parallel Computing and Information Capitalism," in Metropolis and Rota (eds.), A New Era in Computation (1992) [A batch of UC Irvine comp. sci. professors who write like sociologists. " `Information capitalism' refers to forms of organization in which data-intensive techniques and computerization are key strategic resources for corporate production."]
Jure Leskovec, Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets
R. Dean Malmgren, Jake M. Hofman, Luis A. N. Amaral, Duncan J. Watts, "Characterizing Individual Communication Patterns", arxiv:0905.0106
John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis
Ryan J. Tibshirani, "Degrees of Freedom and Model Search", arxiv:1402.1920
Jianming Ye, "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical Association 93 (1998): 120--131

My lecture notes for my data mining class [I learned a lot, in writing these, from notes from the previous version of the course written by Tom Minka, and modesty does not forbid me from recommending his work.]

Ajay Agrawal, Joshua Gans and Avi Goldfarb
- Prediction Machines: The Simple Economics of Artificial Intelligence
- "AI Adoption and System-Wide Change", NBER Working Paper 28811 (2021)
Ian Ayres, Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart [Despite the painful title, Ayres has done cool applied work in social statistics]
Meredith Broussard, Artificial Unintelligence: How Computers Misunderstand the World
Robert Cluley, Marketing Science Fictions: An Ethnography of Marketing Analytics, Consumer Insight, and Data Science
Caroline Criado-Perez, Invisible Women: Data Bias in a World Designed for Men
Nathan Eagle and Kate Greene, Reality Mining: Using Big Data to Engineer a Better World
Andrew Guthrie Ferguson, The Rise of Big Data Policing: Surveillance, Race and the Future of Law Enforcement
Sandra Gonzalez-Bailon, Decoding the Social World: Data Science and the Unintended Consequences of Communication
Timandra Harkness, Big Data: Does Size Matter?
Matteo Pasquinelli and Vladan Joler, "The Nooscope manifested: AI as instrument of knowledge extractivism", AI and Society 36 (2021): 1263--1280
Raj Venkatesan and Jim Lecinski, The AI Marketing Canvas: A Five-Stage Road Map to Implementing Artificial Intelligence in Marketing

Jon Agar, the Government Machine: A Revolutionary History of the Computer
Colin Koopman, How We Became Our Data: A Genealogy of the Informational Person
Jill Lepore, If Then: How the Simulmatics Corporation Invented the Future
Adrian Mackenzie, Machine Learners: Archaeology of a Data Practice
Bruno J. Strasser, Collecting Experiments: Making Big Data Biology

Bertrand S. Clarke and Jennifer L. Clarke, Predictive Statistics: Analysis and Inference Beyond Models
Bertrand Clarke, Ernest Fokoue and Hao Helen Zhang, Principles and Theory for Data Mining and Machine Learning
Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data

David L. Banks and Yasmin H. Said, "Data Mining in Electronic Commerce", Statistical Science 21 (2006): 234--246, math.ST/0609204
Colleen McCue, Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis [To be shot after a fair trial (you should pardon the expression)]
Alice Zheng, Mastering Feature Engineering: Principles & Techniques for Data Scientists

Arvind Agarwal, Jeff M. Phillips, Suresh Venkatasubramanian, "A Unified Algorithmic Framework for Multi-Dimensional Scaling", arxiv:1003.0529
Kerstin Bunte, Michael Biehl and Barbara Hammer, "A General Framework for Dimensionality-Reducing Data Visualization Mapping", Neural Computation 24 (2012): 771--804
Bertrand Clarke, "Desiderata for a Predictive Theory of Statistics", Bayesian Analysis 5 (2010): 1--36
Graham Cormode and Ke Yi, Small Summaries for Big Data
Pavel Dmitriev and Carl Lagoze, "Mining Generalized Graph Patterns based on User Examples", cs.DS/0609153
Usama Fayyad, Geroges G. Grinstein and Andreas Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery
Robert L. Grossman and Richard G. Larson, "State Space Realization Theorems for Data Mining", arxiv:0901.2735
Hillol Kargupta and Philip Chan (eds.), Advances in Distributed and Parallel Knolwedge Discovery
Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, Data Mining: Next Generating Challenges and Future Directions
Nicholas M. Kiefer and C. Erik Larson, "Specification and Informational Issues in Credit Scoring", SSRN/956628
Momin M. Malik, "A Hierarchy of Limitations in Machine Learning", arxiv:2002.05193
Giovanna Menardi, Nicola Torelli, "Training and assessing classification rules with imbalanced data", Data Mining and Knowledge Discovery 28 (2014): 91--122
Michalski, Kubat, Bratko and Bratko (eds.), Machine Learning and Data Mining: Methods and Applications
Rada Mihaclea and Dragomir Radev, Graph-Based Natural Language Processing and Information Retrieval
Petra Kralj Novak, Nada Lavrac and Geoffrey I. Webb, "Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining", Journal of Machine Learning Research 10 (2009): 377--403
Naren Ramakrishnan and Chris Bailey-Kellogg, "Sampling Strategies for Mining in Data-Scarce Domains," cs.CE/0204047
Christian H. Weiss, "Rule generation for categorical time series with Markov assumptions", Statistics and Computing 21 (2011): 1--16 [Variable-length Markov models]
Johannes Wollbold, "Attribute Exploration of Discrete Temporal Transitions", q-bio/0701009
Jun-Ming Xu, Aniruddha Bhargava, Robert Nowak, and Xiaojin Zhu, "Socioscope: Spatio-Temporal Signal Recovery from Social Media" [PDF]
Mohammed Javeed Zaki, >"SPADE: An Efficient Algorithm for Mining Frequent Sequences," Machine Learning 42 (2001): 31--60