Machine Learning, Statistical Inference and Induction

Last update: 21 Apr 2025 21:17
First version:

There's a place where AI, statistics and epistemology-methodology converge, or want to anyhow. "Machine learning" is the AI label: how do we make a machine that can find and learn the regularities in a data set? (If the data set is really, really big, and we care mostly about making practically valuable predictions, this becomes data mining, or "knowledge discovery in databases," KDD.) The statisticians ask very similar questions about model-fitting and hypothesis-testing. The epistemologists are mired in the problem of induction, and "inference to the best explanation" (a phrase, I am told by Kenny Easwaran, coined by Gilbert Harman; link below). The fields over-lap in the most crazy-quilt and arbitrary way: I've heard university librarians arguing over whether specific books should go to the engineering or the philosophy library, for instance.

The connection to neuroscience and cognitive science is plain: how on Earth do human beings, and other critters, actually learn? Given that there are many different strategies, which ones do organisms use, and why, and are they good ones? (It's entirely possible that we've gotten locked in to inefficient learning strategies; then the question becomes whether or not they can be improved.) Studying learning by organisms lets us test theories of learning-in-the-abstract, and vice versa: if we had, say, a good proof that a certain learning scheme simply would not work, we'd know that animals don't use it.

One fairly strong result seems to be that tabulae rasae don't work: you've got to give the machine/baby/scientist some hints, or restrict the field of possible hypotheses initially, or you'll never get anywhere. This was at least implicit in Hume, and I believe the other classical empiricists as well, but they don't seem to have been restrictive enough to account for the way we actually do learn. Natural selection is the obvious candidate for having restricted our hypothesis-set, and for having designed our learning mechanisms.

My positivist temperament can hardly help being pleased by this "attempt to introduce the experimental method of reasoning into moral subjects," which, as data mining, has massive industrial applications. My real interest in this isn't, for once, philosophical. Instead, I want to be able to quantify, or at the very least characterize, self-organization, which means I need a good way of automatically finding patterns or regularities in data-sets. For someone who's got the computational mechanics gospel, this means "inferring statistical complexity," and that means the automated construction of abstract-machine or formal-language models of data-sets. (Alternately: Figuring out how natural things compute.) And doing that well means addressing all the issues people in these areas address, so I figure I ought to just steal from them.

Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16 (2001): 199--231 [Very much including the discussion by others and the reply by Breiman. Thanks to Chris Wiggins for alerting me to this.]
Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
Ulf Grenander, Elements of Pattern Theory
David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining
Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Website, with full text free in PDF]
John H. Holland, Keith J. Holyoak, Richard E. Nisbett, and Paul R. Thagard, Induction: Process of Inference, Learning and Discovery [Review: The Best-Laid Schemes o' Mice an' Men]
Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
Deborah G. Mayo, Error and the Growth of Experimental Knowledge [How to use standard statistical tests to learn from experiment, without Bayesian priors or other a priori folderol. Review: We Have Ways of Making You Talk, or, Long Live Peircism-Popperism-Neyman-Pearson Thought!]
Deborah G. Mayo and D. R. Cox, "Frequentist statistics as a theory of inductive inference", math.ST/0610846
John Norton, "A Material Theory of Induction", Philosophy of Science 70 (2003): 647--670 [PDF reprint]
Jorma Rissanen, Stochastic Complexity in Statistical Inquiry [Review: Less Is More, or, Ecce data!]
Sara J. Shettleworth, Cognition, Evolution and Behavior
Peter Spirtes, Clark Glymour and Richard Scheines, Causation, Prediction, and Search
Chris Thornton, Truth from Trash: How Learning Makes Sense [Well, half a recommendation. Review: Two Cheers for Trash]
V. N. (=Vladimir Naumovich) Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
H. Peyton Young, Individual Strategy and Social Structure [Pretty dumb agents nonetheless able to learn in a basic sense, and what they can accomplish in the way of societies. Review: A Myopic (and Sometimes Blind) Eye on the Main Chance, or, the Origins of Custom]

Shun-ichi Amari, "Information Geometry on Hierarchical Decomposition of Stochastic Interactions," IEEE Transactions on Information Theory 47 (2001): 1701-11 [A way of finding "parts" in complex distributions; uses many differential geometry tricks to do statistics. PDF reprint]
Massimiliano Badino, "An Application of Information Theory to the Problem of the Scientific Experiment", Synthese 140 (2004): 355--389 [MS Word preprint. See comments under Information Theory.]
David Balduzzi
- "Information, learning and falsification", arxiv:1110.3592
- "Falsification and future performance", arxiv:1111.5648
H. B. Barlow, "Unsupervised Learning", Neural Computation1 (1989): 295--311
Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
Mikhail Belkin, Partha Niyogi, Vikas Sindhwani, "Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples", Journal of Machine Learning Research 7 (2006): 2399--2434
William Bialek, Ilya Nemenman, and Naftali Tishby, "Predictability, Complexity and Learning," physics/0007070
Ken Binmore, "Making Decisions in Large Worlds" ["This paper argues that we need to look beyond Bayesian decision theory for an answer to the general problem of making rational decisions under uncertainty." PDF manuscript; thanks to Nicolas Della Penna for the pointer]
Margaret Boden, The Creative Mind: Myths and Mechanisms [How and when to change the kind of representation you're using, a topic shamefully neglected in the literature. Precis]
Josh Bongard and Hod Lipson, "Automated reverse engineering of nonlinear dynamical systems", Proceedings of the National Academy of Sciences (USA) 104 (2007): 9943--9948 [Thanks to Chris Weed for pointing me to this. Interesting, but basically unaware of the literature on state-space reconstruction in nonlinear dynamics.]
R. B. Braithwaite, Scientific Explanation
Arthur W. Burks, "Peirce's Theory of Abduction", Philosophy of Science 13 (1946): 301--306 [JSTOR; ungated copy]
Venkat Chandrasekaran and Michael I. Jordan, "Computational and Statistical Tradeoffs via Convex Relaxation", Proceedings of the National Academy of Sciences (USA) 110 (2013): E1181--E1190, arxiv:1211.1073
Pedro Domingos
- "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online]
- "A Few Useful Things to Know about Machine Learning" [PDF preprint]
Marco Dorigo and Marco Colombetti, Robot Shaping: An Experiment in Behavior Engineering [Review: Crawling Towards the Light]
John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
Francois Fleuret and Donald Geman, "Stationary Features and Cat Detection", Journal of Machine Learning Research 9 (2008): 2549--2578
Peter Godfrey-Smith, "Inductions, Samples, and Kinds" [PDF preprint]
David J. Hand, "Classifier Technology and the Illusion of Progress", Statistical Science 21 (2006): 1--15, math.ST/0606441 [Or: don't believe everything you read in ICML! With commentary, available from the arxiv.org link]
Hinton and Sejnowski (eds.), Unsupervised Learning [A sort of "Neural Computation's Greatest Hits" compilation]
Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", forthcoming in NeurIPS 2021, arxiv:2110.01584
Tommi S. Jaakkola and David Haussler, "Exploiting generative models in discriminative classifiers", NIPS 11 (1998) [PDF]
Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
Kevin T. Kelly [Kelly's work on Occam's Razor is, so far as I know, the only justification for it which doesn't either massively beg the question, change the subject, or make massive assumptions about the nature of the world, Divine Providence, etc.]
- "A New Solution to the Puzzle of Simplicity", phil-sci/2984 [One of his clearest papers]
- Ockham Project Web Page
Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
Jerzy Neyman, First Course in Probability and Statistics [Fine explanation of his ideas about "rules of inductive behavior" --- which probably isn't very good methodology, but has the makings of excellent robotics]
Leonid Peshkin, "Structure induction by lossless graph compression", cs.DS/0703132 [Adapting data-compression ideas to discover hierarchical structures in graphs, e.g., the 4 bases from a tinker-toy model of DNA.]
Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer and Andrew Y. Ng, "Self-taught learning: Transfer learning from unlabeled data", ICML 2007 [PDF. This is a clever idea for semi-supervised learning. Given a big supply of unlabeled examples, and a small number of labeled examples, use the unlabeled ones to learn a high-level/abstract representation or set of features. Then use those features in straightforward classifier learning on the labeled data. (They have a specific idea for learning the higher-level representation, by basis selection, but that's a separable issue.)]
Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
Gerhard Schurz, "Universal vs. Local Prediction Strategies: A Game-Theoretical Approach to the Problem of Induction", phil-sci/3720 [Slides only?!?]
Spyros Skouras, "Decisionmetrics: Towards a Decision-Based Approach to Econometrics" [Suppose what you really want to do with your model is to make decisions, e.g., to buy and sell and make money doing so. Then fitting the model to minimize a standard error measure, e.g., mean square error, often gives worse performance than fitting the model to minimize expected losses. This applies much more broadly than Spyros's financial examples may suggest.]
Aris Spanos, "The Curve-Fitting Problem, Akaike-type Model Selection, and the Error Statistical Approach" [PDF preprint]
Sara van de Geer, Applications of Empirical Process Theory [A.k.a. Empirical Process Theory in M-Estimation]
Greg Ver Steeg, Aram Galstyan, "Discovering Structure in High-Dimensional Data Through Correlation Explanation", arxiv:1406.1222
Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic Learning in a Random World [Mini-review]
Blaz Zupan, Marko Bohanec, Janez Demsar and Ivan Bratko, "Learning by discovering concept hierarchies", Artificial Intelligence 109 (1999): 211--242 [Thanks to Aleks Jakulin for letting me know about this. PDF preprint]

Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]
Jacob Feldman, "How surprising is a simple pattern? Quantifying 'Eureka!'," Cognition 93(2004): 199--224 [Claims to (a) have a psychologically valid measure of subjective complexity, and (b) derive a null distribution for it. But the evidence that his particular complexity measure captures what people do in concept-learning problems is deferred to other papers.]
Gilbert Harman and Sanjeev Kulkarni, Reliable Reasoning: Induction and Statistical Learning Theory [Published by MIT Press; 2006 draft free online via Prof. Kulkarni (about 100 pages). The technical material on learning theory is mostly alright, so far as it goes, but the philosophy is irritatingly lack-luster. Definitely not worth paying what the publisher charges for it. — There is now a good review by Kevin Kelly and Conor Mayo-Wilson.]

CRS, Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata [Ph.D. thesis, UW-Madison, 2001]
CRS, "Dynamics of Bayesian Updating with Dependent Data and Mis-specified Models", arxiv:0901.1342 = Electronic Journal of Statistics 3 (2009): 1039--1074
CRS and Kristina Lisa Klinkner, "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", pp. 504--511 in UAI 2004, cs.LG/0406011

Steven Abney, "Bootstrapping", ACL 2002, pp. 360--367 [In the sense of "a problem setting in which one is given a small set of labeled data and a large set of unlabeled data, and the task is to induce a classifier", not the famous statistical procedure]
Tatsuya Akutsu, Satoru Miyanoa and Satoru Kuhar, "A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis," Theoretical Computer Science 292 (2002): 481--495
Atocha Aliseda, Abductive Reasoning: Logical Investigations into Discovery and Explanation
Andris Ambainis, "Probabilistic inductive inference: a survey", cs.LG/9902026 [Taking "inductive inference" exclusively in the sense of learning recursive functions]
Rosa I. Arriaga and Santosh Vempala, "An algorithmic theory of learning: Robust concepts and random projection", Machine Learning 63 (2006): 161--182
Nihat Ay
- "Locality of global stochastic interaction in directed acyclic networks," preprint, MPI-MIS 54/2001
- "An information geometric approach to a theory of pragmatic structuring," MPI-MIS 52/2000
Vijay Balasubramanian, "Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions", Neural Computation 9 (1997): 349--368, arxiv:cond-mat/9601030
Pierre Baldi et al., Modeling the Internet and the Web: Probabilistic Methods and Algorithms
William Bechtel and Robert C. Richardson, Discovering Complexity: Decomposition and Localization as Strategies in Scientific Research
Sergey V. Beiden, Marcus A. Maloof and Robert F. Wagner, "A General Model for Finite-Sample Effects in Training and Testing of Competing Classifiers", IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003): 1561--1569
Ron Bekkerman, Mikhail Bilenko and John Langford (eds.), Scaling up Machine Learning: Parallel and Distributed Approaches
D. Paul Benjamin (ed.), Change of Representation and Inductive Bias
James Blachowicz, Of Two Minds: The Nature of Inquiry [From the back cover: "The logic of correction developed here directly opposes the claim made by evolutionary epistemologists such as Popper and Campbell that there is no such thing as a 'logical method for having new ideas.' ... This comprehensive and revolutionary theory challenges traditional epistemology's conception of justification and provides substantial new interpretations of the nature of ampliative inference, representation and meaning, Platonic and Hegelian dialectic, Kantian analysis, the heuristic function of models and metaphors, and the role of inquiry in the constitution of human consciousness." All this in only four hundred pages! But the stuff on a logic of correction is very important --- if correct.]
Gilles Blanchard and Donald Geman, "Hierarchical testing designs for pattern recognition", math.ST/0507421 = Annals of Statistics 33 (2005): 1155--1202
Avrim Blum, and Tom Mitchell, "Combing Labeled and Unlabeled Data with Co-Training", COLT 98, pp. 92--100
Avrim Blum, Adam Kalai and Hal Wasserman, "Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model," cs.LG/0010022
Leo Breiman, "Prediction Games and Arcing Algorithms," Neural Computation 11 (1999): 1493--1517
Robert Alan Brown, Machines that Learn: Based on the Principle of Empirical Control
Christopher J. C. Burges, "Dimension Reduction: A Guided Tour", Foundations and Trends in Machine Learning 2:4 (2010) [Preprint version]
Meir Buzaglo, The Logic of Concept Expansion
Adam Cannon, J. Mark Ettinger, Don Hush, and Clint Scovel, "Machine Learning with Data Dependent Hypothesis Classes," Journal of Machine Learning Research 2 (2002): 335--358
Philip Ellery Catton, "The Justification(s) of Induction(s)," online
Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets Using Efficient Estimation of High-Dimensional Mutual Information", IEEE Transactions on Neural Networks 16 (2005): 213--224
Andy Clark and Chris Thornton, "Trading Spaces: Computation, Representation and the Limits of Uninformed Learning," Behavioral and Brain Sciences (1997) 20:57--90 [Draft]
Bertrand Clarke, "Desiderata for a Predictive Theory of Statistics", Bayesian Analysis 5 (2010): 1--36
David Corfield, "Varieties of Justification in Machine Learning", Minds and Machines 20 (2010): 291--301
Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
Mark Culp, George Michailidis and Kjell Johnson, "On multi-view learning with additive models", Annals of Applied Statistics 3 (2009): 292--318 = arxiv:0906.1117
H. Daume III, D. Marcu, "Domain Adaptation for Statistical Classifiers", arxiv:1109.6341
Peter Dayan, "Recurrent Sampling Models for the Helmholtz Machine," Neural Computation 11 (1999): 653--677
Carlos R. de la Mora B., Carlos Gershenson and Angelica Garcia-Vega, "The role of behavior modifiers in representation development", cs.AI/0403006
Luc Devroye et al., A Probabilistic Theory of Pattern Recognition
Thomas G. Dietterich, "Machine Learning for Sequential Data" [PDF. Thanks to Gustavo Lacerda for a pointer.]
Nicola Di Mauro, Teresa M.A. Basile, Stefano Ferilli, Floriana Esposito, "Feature Construction for Relational Sequence Learning", arxiv:1006.5188
Pedro Domingos [All from his web-site]
- A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
- Mining High-Speed Data Streams
- Mining Time-Changing Data Streams
Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
Deniz Erdogmus, Kenneth E. Hild, II, Yadunandana N. Rao and José C. Príncipe, "Minimax Mutual Information Approach for Independent Component Analysis", Neural Computation 16 (2004): 1235--1252
Oleg V. Favorov and Dan Ryder, "SINBAD: A neocortical mechanism for discovering environmental variables and regularities hidden in sensory input", Biological Cybernetics 90 (2004): 191--202
Aidan Feeney and Evan Heit (eds.), Inductive Reasoning: Experimental, Developmental, and Computational Approaches
David Finton, "When Do Differences Matter? On-Line Feature Extraction Through Cognitive Economy", cs.LG/0404032 = Cognitive Systems Research 6 (2005): 263--281
Gary William Flake, "The Calculus of Jacobian Adaptation"
Francois Fleuret and Eric Brunet, "DEA: An Architecture for Goal Planning and Classification," Neural Computation 12 (2000): 1987--2008
Flocchini et al. (eds.), Structure, Information and Communication Complexity
Malcolm R. Forster, "How do Simple Rules 'Fit to Reality' in a Complex World?", Minds and Machines 9 (1999): 543--564 [A take on the Gigerenzer et al. idea of fast and frugal heuristics, especially their ecological adaptation to the evnironment. "The main purpose of this article is to apply these ideas to learning rules --- methods for constructing, selecting or evaluating competing hypotheses in science, and to the methodology of machine learning... The bad news is that ecological validity is particularly difficult to implement and difficult to understand. The good news is that it builds an important bridge from normative psychology and machine learning to recent work in the philosophy of science, which considers predictive accuracy to be a primary goal of science."]
Paul Franchesi, "A Solution to Goodman's Paradox," Dialogue 40 (2001) [online]
Vinod Goel and Raymond J. Dolan, "Differential involvement of left prefrontal cortex in inductive and deductive reasoning", Cognition 93 (2004): B109--B121
Ulf Grenander, Abstract Inference
Ulf Grenander and Michael Miller, Pattern Theory: From Representation to Inference
Laszlo Gyorfi et al., A Distribution-Free Theory of Nonparametric Regression
Stephen José Hanson et al., eds., Computational Learning Theory and Natural Learning Systems
- I: Constraints and Prospects
- II: Interactions between Theory and Experiment
Petr Hajek and Martin Holena, "Formal logics of discovery and hypothesis formation by machine," Theoretical Computer Science 292 (2002): 345-357
Peter Hall and Qiwei Yao, "Approximating conditional distribution functions using dimension reduction", math.ST/0507432 = Annals of Statistics 33 (2005): 1404--1421
Gilbert H. Harman, "The Inference to the Best Explanation", The Philosophical Review 74 (1965): 88--95 [JSTOR; thanks to Kenny Easwaran for the pointer]
Patrick Heas and Mihai Datcu, "Supervised learning on graphs of spatio-temporal similarity in satellite image sequences", 0709.3013
David F. Hendry and Jurgen A. Doornik, Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics
Jaako Hintikka
- Socratic Epistemology: Explorations of Knowledge-Seeking by Questioning
- Inquiry as Inquiry: A Logic of Scientific Discovery
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka and Hannu Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies," The Computer Journal 42 (1999): 100--111
Eyke Hüllermeier, Willem Waegeman, "Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods", arxiv:1910.09457
Christian Igel and Marc Toussaint, "On Classes of Functions for which No Free Lunch Results Hold," cs.NE/0108011
Lancelot F. James, David J. Marchette and Carey Priebe, "Consistent estimation of mixture complexity", Annals of Statistics 29 (2001): 1281--1296
John R. Josephson and Susan G. Josephson (eds.), Abductive Inference: Computation, Philosophy, Technology
Yuri Kalnishkan, Vladimir Vovk and Michael V. Vyugin, "How many strings are easy to predict?", Information and Computation 201 (2005): 55--71 ["It is well known in the theory of Kolmogorov complexity that most strings cannot be compressed; more precisely, only exponentially few (O(2^n-m)) binary strings of length n can be compressed by m bits. This paper extends the 'incompressibility' property of Kolmogorov complexity to the 'unpredictability' property of predictive complexity. The 'unpredictability' property states that predictive complexity (defined as the loss suffered by a universal prediction algorithm working infinitely long) of most strings is close to a trivial upper bound (the loss suffered by a trivial minimax constant prediction strategy). We show that only exponentially few strings can be successfully predicted and find the base of the exponent."]
Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
Kevin T. Kelly
- The Logic of Reliable Inquiry [Includes cartoons by the author]
Eric D. Kolaczyk and Robert D. Nowak, "Multiscale likelihood analysis and complexity penalized estimation", math.ST/0406424 = Annals of Statistics 32 (2004): 500--527
Ingo Kreuz and Dieter Roller, "Relevant Knowledge First: Reinforcement Learning and Forgetting in Knowledge Based Configuration," cs.AI/0109034
Henry E. Kyburg Jr. and Choh Man Teng, "Evaluating Defaults," cs.AI/0207083
Steffen Lange and Gunter Grieser, "Variants of iterative learning," Theoretical Computer Science 292 (2002): 359--376
Nicolas Le Roux and Yoshua Bengio, "Deep Belief Networks Are Compact Universal Approximators", Neural Computation 22 (2010): 2192--2207
F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
Stephen Luttrell, "Using Self-Organising Mappings to Learn the Structure of Data Manifolds", cs.NE/0406017
David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
Adrian Mackenzie, Machine Learners: Archaeology of a Data Practice
Sridhar Mahadevan, Representation Discovery Using Harmonic Analysis
Gideon S. Mann and Andrew McCallum, "Generalized expectation criteria for semi-supervised learning with weakly labeled data", Journal of Machine Learning Research 11 (2010): 955--984
Heikki Mannila and Kari-Jouko Räihä, "On the complexity of inferring functional dependencies," Discrete Applied Mathematics 40 (1992): 237--243
Martin and Osherson, Elements of Scientific Inquiry [A good introduction to the theory of formal learning, especially of recursive functions in the absence of noise. Not even hand-waving that this is a sensible idealization of what scientists do.]
Conor Mayo-Wilson, Combining Causal Theories and Dividing Scientific Labor [Ph.D. thesis, CMU Philosophy Dept., 2012; thanks to Dr. Mayo-Wilson for a copy]
Geoffrey J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition
Abraham Meidan and Boris Levin, "Choosing from Competing Theories in Computerised Learning", Minds and Machines 12 (2002): 119--129
I. J. Myung, Vijay Balasubramanian and M. A. Pitt, "Counting probability distributions: Differential geometry and model selection", Proceedings of the National Academy of Sciences (USA) 97 (2000): 11170--11175
National Research Council, Massive Data Sets [Online]
O. Nelles, Nonlinear System Identification
Ilya Nemenman, "Fluctuation-Dissipation Theorem and Models of Learning", Neural Computation 17 (2005): 2006--2033 ["We analyze how various abstract Bayesian learners perform on different data and argue that it is difficult to determine which learning-theoretic computation is performed by a particular organism using just its performance in learning a stationary target (learning curve). Based on the fluctuation-dissipation relation in statistical physics, we then discuss a different experimental setup that might be able to solve the problem."]
Kamal Nigam and Rayid Ghani, "Analyzing the Effectiveness and Applicability of Co-training", CIKM 2000, pp. 86--93
Liam Paninski, "Asymptotic Theory of Information-Theoretic Experimental Design", Neural Computation 17 (2005): 1480--1507
Hanchuan Peng, Fuhui Long and Chris Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 1226--1238 [This sounds like an idea I had in 2002, and was too dumb/lazy to follow up on.]
Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau and Leslie Pack Kaelbling, "Learning to Cooperate via Policy Search," cs.LG/0105032
Leonid Peshkin and Christian R. Shelton, "Learning from Scarce Experience," cs.AI/0204043
Karl Pfleger
- On-Line Learning of Undirected Sparse n-grams
- Learning Predictive Compositional Hierarchies [PS.gz]
Fenna H. Poletiek, Hypothesis Testing Behaviour [Review by Denny Borsboom]
Joel B. Predd, Sanjeev R. Kulkarni and H. Vincent Poor
- "Consistency in Models for Distributed Learning under Communication Constraints", cs.IT/0503071
- "Distributed Learning in Wireless Sensor Networks", cs.IT/0503072
Detlef Prescher, "A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars", cs.CL/0412015
Vasin Punyakanok and Dan Roth, "The Use of Classifiers in Sequential Inference," cs.LG/0111003
Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence (eds.), Dataset Shift in Machine Learning
Maxim Raginsky, "A complexity-regularized quantization approach to nonlinear dimensionality reduction", cs.IT/0501091
Magnus Rattray, "Stochastic trapping in a solvable model of on-line independent component analysis," cond-mat/0105057
Suman Ravuri, Mélanie Rey, Shakir Mohamed, Marc Deisenroth, "Understanding Deep Generative Models with Generalized Empirical Likelihoods", arxiv:2306.09780
Salah Rifai, Yoshua Bengio, Yann Dauphin, Pascal Vincent, "A Generative Process for Sampling Contractive Auto-Encoders", ICML 2012, arxiv:1206.6434
Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito, "On Learning with Integral Operators", Journal of Machine Learning Research 11 (2010): 905--934
Dan Roth, "Learning in Natural Language: Theory and Algorithmic Approaches" [online]
Hichem Sahbi and Donald Geman, "A Hierarchy of Support Vector Machines for Pattern Detection", Journal of Machine Learning Research 7 (2006): 2087--2123
Erik Sandewall, Features and Fluents: The Representation of Knowledge about Dynamical systems
Gerhard Schurz
- "Meta-Induction and the Prediction Game: A New View On Hume's Problem" [PDF preprint]
- "Patterns of Abduction" [PDF preprint]
Alcino J. Silva, Anthony Landreth, and John Bickle, Engineering the Next Revolution in Neuroscience: The New Science of Experiment Planning
Aris Spanos
- "Statistical Induction, Severe Testing, and Model Validation" [Preprint]
- "Revisiting data mining: `hunting' with or without a license", Journal of Economic Methodology 7 (2000): 231--264 [PDF reprint]
Peter Sollich and Anason Halees, "Learning curves for Gaussian process regression: Approximations and bounds," cond-mat/0105015
Ray Solomonoff's Papers
Sonnenberg et al., "The SHOGUN Machine Learning Toolbox", Journal of Machine Learning Research 11 (2010): 1799--1802
Eduardo D Sontag, "Adaptation Implies Internal Model," math.OC/0203228
Susanne Still, "Information theoretic approach to interactive learning", arxiv:0709.1948
Ron Sun and C. L. Giles (eds.), Sequence Learning: Paradigms, Algorithms, and Applications
Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning
Sebastian Thrun and Lorien Pratt (eds.), Learning to Learn
Robert Tibshirani and Larry Wasserman, "Correlation-sharing for detection of differential gene expression", math.ST/0608061 ["Our proposal averages the univariate scores of each feature with the scores in correlation neighborhoods. ... The general idea of correlation-sharing can be applied to other prediction problems involving a large number of correlated features."]
Nicholas B. Turk-Browne, Brian J. Scholl, Marvin M. Chun, and Marcia K. Johnson, "Neural Evidence of Statistical Learning; Efficient Detection of Visual Regularities Without Awareness", Journal of Cognitive Neuroscience 21 (2009): 1934--1945
Richard Turner, Maneesh Sahani, "A Maximum-Likelihood Interpretation for Slow Feature Analysis", Neural Computation 19 (2007): 1022-1038
Peter D. Turney, "How to shift bias: Lessons from the Baldwin effect," Evolutionary Computation 4 (1996): 271-295 [online]
Satoshi Watanabe, Knowing and Guessing: A Quantitative Study of Inference and Information
Ying Yang, Xindong Wu and Xingquan Zhu, "Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams", Data Mining and Knowledge Discovery 13 (2006): 261--289
H. Zha, X. He, C. Ding, M. Gu and H. Simon, "Bipartite Graph Partitioning and Data Clustering," cs.IR/0108018

~~To be shot after a fair trial~~

Menachem Stern, Arvind Murugan, "Learning without neurons in physical systems", arxiv:2206.05831

CRS, Causal Architecture and Model Discovery: Theory, Algorithms and Examples
CRS, "Three Kinds of Complexity in Prediction: Induction, Estimation and Calculation"