## Machine Learning, Statistical Inference and Induction

*01 Jun 2018 10:41*

There's a place where AI, statistics and epistemology-methodology converge, or want to anyhow. "Machine learning" is the AI label: how do we make a machine that can find and learn the regularities in a data set? (If the data set is really, really big, and we care mostly about making practically valuable predictions, this becomes data mining, or "knowledge discovery in databases," KDD.) The statisticians ask very similar questions about model-fitting and hypothesis-testing. The epistemologists are mired in the problem of induction, and "inference to the best explanation" (a phrase, I am told by Kenny Easwaran, coined by Gilbert Harman; link below). The fields over-lap in the most crazy-quilt and arbitrary way: I've heard university librarians arguing over whether specific books should go to the engineering or the philosophy library, for instance.

The connection to neuroscience and cognitive science is plain: how on Earth do
human beings, and other critters, actually learn? Given that there are many
different strategies, which ones do organisms use, and why, and are they good
ones? (It's entirely possible that we've gotten locked in to inefficient
learning strategies; then the question becomes whether or not they can be
improved.) Studying learning by organisms lets us test theories of
learning-in-the-abstract, and vice versa: if we had, say, a good proof that a
certain learning scheme simply would not work, we'd *know* that animals
don't use it.

One fairly strong result seems to be that *tabulae rasae* don't work:
you've got to give the machine/baby/scientist *some* hints, or restrict
the field of possible hypotheses initially, or you'll never get anywhere. This
was at least implicit in Hume, and I believe the other
classical empiricists as well, but they don't seem to have been restrictive
*enough* to account for the way we actually do learn. Natural selection is the obvious candidate for
having restricted our hypothesis-set, and for having designed our learning
mechanisms.

My positivist temperament can hardly help being pleased by this "attempt to introduce the experimental method of reasoning into moral subjects," which, as data mining, has massive industrial applications. My real interest in this isn't, for once, philosophical. Instead, I want to be able to quantify, or at the very least characterize, self-organization, which means I need a good way of automatically finding patterns or regularities in data-sets. For someone who's got the computational mechanics gospel, this means "inferring statistical complexity," and that means the automated construction of abstract-machine or formal-language models of data-sets. (Alternately: Figuring out how natural things compute.) And doing that well means addressing all the issues people in these areas address, so I figure I ought to just steal from them.

See also: Causality; collective cognition; clustering; ensemble methods; grammatical inference; graphical models; learning in games; learning theory; the minimum description length principle; model selection; neural nets; Occam's razor; scientific thinking; sequential decision-making; statistics with structured data; time series; and universal prediction algorithms now get their own notebooks; other topics also need to be spun off from this one.

- Recommended, big picture:
- Leo Breiman, "Statistical Modeling: The Two Cultures",
Statistical
Science
**16**(2001): 199--231 [Very much including the discussion by others and the reply by Breiman. Thanks to Chris Wiggins for alerting me to this.] - Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Ulf Grenander, Elements of Pattern Theory
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining
- Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Website, with full text free in PDF]
- John H. Holland, Keith J. Holyoak, Richard E. Nisbett, and Paul R. Thagard, Induction: Process of Inference, Learning and Discovery [Review: The Best-Laid Schemes o' Mice an' Men]
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
- Deborah G. Mayo, Error and the Growth of Experimental Knowledge [How to use standard statistical tests to learn from experiment, without Bayesian priors or other a priori folderol. Review: We Have Ways of Making You Talk, or, Long Live Peircism-Popperism-Neyman-Pearson Thought!]
- Deborah G. Mayo and D. R. Cox, "Frequentist statistics as a theory of inductive inference", math.ST/0610846
- John Norton, "A Material Theory of Induction", Philosophy of Science
**70**(2003): 647--670 [PDF reprint] - Jorma Rissanen, Stochastic Complexity in Statistical
Inquiry [Review: Less Is
More, or,
*Ecce data!*] - Sara J. Shettleworth, Cognition, Evolution and Behavior
- Peter Spirtes, Clark Glymour and Richard Scheines, Causation, Prediction, and Search
- Chris Thornton, Truth from Trash: How Learning Makes Sense [Well, half a recommendation. Review: Two Cheers for Trash]
- V. N. (=Vladimir Naumovich) Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- H. Peyton Young, Individual Strategy and Social Structure [Pretty dumb agents nonetheless able to learn in a basic sense, and what they can accomplish in the way of societies. Review: A Myopic (and Sometimes Blind) Eye on the Main Chance, or, the Origins of Custom]

- Recommended, close-ups:
- Shun-ichi Amari, "Information Geometry on Hierarchical
Decomposition of Stochastic Interactions," IEEE Transactions on
Information Theory
**47**(2001): 1701-11 [A way of finding "parts" in complex distributions; uses many differential geometry tricks to do statistics. PDF reprint] - Massimiliano Badino, "An Application of Information Theory to the
Problem of the Scientific
Experiment", Synthese
**140**(2004): 355--389 [MS Word preprint. See comments under Information Theory.] - David Balduzzi
- "Information, learning and falsification", arxiv:1110.3592
- "Falsification and future performance", arxiv:1111.5648

- H. B. Barlow, "Unsupervised Learning", Neural Computation1 (1989): 295--311
- Jonathan Baxter, "A Model of Inductive Bias Learning,"
Journal of Artificial Intelligence Research
**12**(2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.] - Mikhail Belkin, Partha Niyogi, Vikas Sindhwani, "Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples", Journal of Machine Learning Research
**7**(2006): 2399--2434 - William Bialek, Ilya Nemenman, and Naftali Tishby, "Predictability, Complexity and Learning," physics/0007070
- Ken Binmore, "Making Decisions in Large Worlds" ["This paper argues that we need to look beyond Bayesian decision theory for an answer to the general problem of making rational decisions under uncertainty." PDF manuscript; thanks to Nicolas Della Penna for the pointer]
- Margaret Boden, The Creative Mind: Myths and Mechanisms [How and when to change the kind of representation you're using, a topic shamefully neglected in the literature. Precis]
- Josh Bongard and Hod Lipson, "Automated reverse engineering of
nonlinear dynamical
systems", Proceedings
of the National Academy of Sciences (USA)
**104**(2007): 9943--9948 [Thanks to Chris Weed for pointing me to this. Interesting, but basically unaware of the literature on state-space reconstruction in nonlinear dynamics.] - R. B. Braithwaite, Scientific Explanation
- Arthur W. Burks, "Peirce's Theory of Abduction", Philosophy of Science
**13**(1946): 301--306 [JSTOR; ungated copy] - Venkat Chandrasekaran and Michael I. Jordan, "Computational and Statistical Tradeoffs via Convex Relaxation", Proceedings of the National Academy of Sciences (USA)
**110**(2013): E1181--E1190, arxiv:1211.1073 - Pedro
Domingos
- "The Role of Occam's Razor in Knowledge Discovery," Data
Mining and Knowledge Discovery,
**3**(1999) [Online] - "A Few Useful Things to Know about Machine Learning" [PDF preprint]

- "The Role of Occam's Razor in Knowledge Discovery," Data
Mining and Knowledge Discovery,
- Marco Dorigo and Marco Colombetti, Robot Shaping: An Experiment in Behavior Engineering [Review: Crawling Towards the Light]
- John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
- Francois Fleuret and Donald Geman, "Stationary Features and Cat
Detection", Journal of
Machine Learning Research
**9**(2008): 2549--2578 - Peter Godfrey-Smith, "Inductions, Samples, and Kinds" [PDF preprint]
- David J. Hand, "Classifier Technology and the Illusion of Progress",
Statistical
Science
**21**(2006): 1--15, math.ST/0606441 [Or: don't believe everything you read in ICML! With commentary, available from the arxiv.org link] - Hinton and Sejnowski (eds.), Unsupervised Learning [A sort of "Neural Computation's Greatest Hits" compilation]
- Tommi S. Jaakkola and David Haussler, "Exploiting generative models in discriminative classifiers", NIPS 11 (1998) [PDF]
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
- Kevin T. Kelly
[Kelly's work on Occam's Razor is, so far as I know, the only justification for it which doesn't either massively beg the question, change the subject, or make
massive assumptions about the nature of the world, Divine Providence, etc.]
- "A New Solution to the Puzzle of Simplicity", phil-sci/2984 [One of his clearest papers]
- Ockham Project Web Page

- Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
- Jerzy Neyman, First Course in Probability and Statistics [Fine explanation of his ideas about "rules of inductive behavior" --- which probably isn't very good methodology, but has the makings of excellent robotics]
- Leonid Peshkin, "Structure induction by lossless graph compression", cs.DS/0703132 [Adapting data-compression ideas to discover hierarchical structures in graphs, e.g., the 4 bases from a tinker-toy model of DNA.]
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
- Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer and Andrew Y. Ng, "Self-taught learning: Transfer learning from unlabeled data",
ICML 2007
[PDF.
This is a clever idea for semi-supervised learning. Given a big supply of
unlabeled examples, and a small number of labeled examples, use the unlabeled
ones to learn a high-level/abstract representation or set of features. Then
use
*those*features in straightforward classifier learning on the labeled data. (They have a specific idea for learning the higher-level representation, by basis selection, but that's a separable issue.)] - Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
- Gerhard Schurz, "Universal vs. Local Prediction Strategies: A Game-Theoretical Approach to the Problem of Induction", phil-sci/3720 [Slides only?!?]
- John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis
- Spyros Skouras, "Decisionmetrics: Towards a Decision-Based Approach to Econometrics" [Suppose what you really want to do with your model is to make decisions, e.g., to buy and sell and make money doing so. Then fitting the model to minimize a standard error measure, e.g., mean square error, often gives worse performance than fitting the model to minimize expected losses. This applies much more broadly than Spyros's financial examples may suggest.]
- Aris Spanos, "The Curve-Fitting Problem, Akaike-type Model Selection, and the Error Statistical Approach" [PDF preprint]
- Sara van de Geer, Applications of Empirical Process Theory [A.k.a. Empirical Process Theory in M-Estimation]
- Greg Ver Steeg, Aram Galstyan, "Discovering Structure in High-Dimensional Data Through Correlation Explanation", arxiv:1406.1222
- Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic Learning in a Random World [Mini-review]
- Blaz Zupan, Marko Bohanec, Janez Demsar and Ivan Bratko, "Learning
by discovering concept hierarchies", Artificial
Intelligence
**109**(1999): 211--242 [Thanks to Aleks Jakulin for letting me know about this. PDF preprint]

- Not exactly recommended:
- Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]
- Jacob Feldman, "How surprising is a simple pattern? Quantifying
'Eureka!'," Cognition
**93**(2004): 199--224 [Claims to (a) have a psychologically valid measure of*subjective*complexity, and (b) derive a null distribution for it. But the evidence that his particular complexity measure captures what people do in concept-learning problems is deferred to other papers.] - Gilbert Harman and Sanjeev Kulkarni, Reliable Reasoning:
Induction and Statistical Learning Theory [Published by MIT Press; 2006
draft free
online via Prof. Kulkarni (about 100 pages). The technical material
on learning theory is mostly alright, so far
as it goes, but the
*philosophy*is irritatingly lack-luster. Definitely not worth paying what the publisher charges for it. — There is now a good review by Kevin Kelly and Conor Mayo-Wilson.]

- Modesty forbids me to recommend:
- CRS, Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata [Ph.D. thesis, UW-Madison, 2001]
- CRS, "Dynamics of Bayesian Updating with Dependent Data and
Mis-specified
Models", arxiv:0901.1342
= Electronic Journal of
Statistics
**3**(2009): 1039--1074 - CRS and Kristina Lisa Klinkner, "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", pp. 504--511 in UAI 2004, cs.LG/0406011

- To read:
- Steven Abney, "Bootstrapping", ACL 2002, pp. 360--367 [In the sense of "a problem setting in which one is given a small set of labeled data and a large set of unlabeled data, and the task is to induce a classifier", not the famous statistical procedure]
- Tatsuya Akutsu, Satoru Miyanoa and Satoru Kuhar, "A simple greedy
algorithm for finding functional relations: efficient implementation and
average case analysis," Theoretical
Computer Science
**292**(2002): 481--495 - Atocha Aliseda, Abductive Reasoning: Logical Investigations into Discovery and Explanation
- Andris Ambainis, "Probabilistic inductive inference: a survey", cs.LG/9902026 [Taking "inductive inference" exclusively in the sense of learning recursive functions]
- Rosa I. Arriaga and Santosh Vempala, "An algorithmic theory of learning: Robust concepts and random projection", Machine Learning
**63**(2006): 161--182 - Nihat Ay
- "Locality of global stochastic interaction in directed acyclic networks," preprint, MPI-MIS 54/2001
- "An information geometric approach to a theory of pragmatic structuring," MPI-MIS 52/2000

- Vijay Balasubramanian, "Statistical Inference, Occam's Razor,
and Statistical Mechanics on the Space of Probability Distributions",
Neural Computation
**9**(1997): 349--368, arxiv:cond-mat/9601030 - Pierre Baldi et al., Modeling the Internet and the Web: Probabilistic Methods and Algorithms
- Jayanta Basak, "Online Adaptive Decision Trees",
Neural
Computation
**16**(2004): 1959--1981 - William Bechtel and Robert C. Richardson, Discovering Complexity: Decomposition and Localization as Strategies in Scientific Research
- Sergey V. Beiden, Marcus A. Maloof and Robert F. Wagner, "A
General Model for Finite-Sample Effects in Training and Testing of Competing
Classifiers", IEEE Transactions on Pattern Analysis and Machine
Intelligence
**25**(2003): 1561--1569 - Ron Bekkerman, Mikhail Bilenko and John Langford (eds.), Scaling up Machine Learning: Parallel and Distributed Approaches
- D. Paul Benjamin (ed.), Change of Representation and Inductive Bias
- James Blachowicz, Of Two Minds: The Nature of Inquiry
[From the back cover: "The logic of
*correction*developed here directly opposes the claim made by evolutionary epistemologists such as Popper and Campbell that there is no such thing as a 'logical method for having new ideas.' ... This comprehensive and revolutionary theory challenges traditional epistemology's conception of justification and provides substantial new interpretations of the nature of ampliative inference, representation and meaning, Platonic and Hegelian dialectic, Kantian analysis, the heuristic function of models and metaphors, and the role of inquiry in the constitution of human consciousness." All this in only four hundred pages! But the stuff on a logic of correction is very important --- if correct.] - Gilles Blanchard and Donald Geman, "Hierarchical testing designs
for pattern recognition", math.ST/0507421 = Annals of
Statistics
**33**(2005): 1155--1202 - Hendrik Blockeel and Jan Struyf, "Efficient algorithms for decision tree cross-validation," cs.LG/0110036
- Avrim Blum, and Tom Mitchell, "Combing Labeled and Unlabeled Data with Co-Training", COLT 98, pp. 92--100
- Avrim Blum, Adam Kalai and Hal Wasserman, "Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model," cs.LG/0010022
- Leo Breiman, "Prediction Games and Arcing Algorithms," Neural
Computation
**11**(1999): 1493--1517 - Robert Alan Brown, Machines that Learn: Based on the Principle of Empirical Control
- Christopher J. C. Burges, "Dimension Reduction: A Guided Tour",
Foundations and Trends in Machine Learning
**2:4**(2010) [Preprint version] - Meir Buzaglo, The Logic of Concept Expansion
- Adam Cannon, J. Mark Ettinger, Don Hush, and Clint Scovel,
"Machine Learning with Data Dependent Hypothesis Classes," Journal of Machine Learning Research
**2**(2002): 335--358 - Philip Ellery Catton, "The Justification(s) of Induction(s)," online
- Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets
Using Efficient Estimation of High-Dimensional Mutual Information", IEEE
Transactions on Neural Networks
**16**(2005): 213--224 - Andy Clark and Chris Thornton, "Trading Spaces: Computation,
Representation and the Limits of Uninformed Learning," Behavioral and
Brain Sciences (1997)
**20**:57--90 [Draft] - Bertrand Clarke, "Desiderata for a Predictive Theory of Statistics",
Bayesian Analysis
**5**(2010): 1--36 - David Corfield, "Varieties of Justification in Machine Learning",
Minds and Machines
**20**(2010): 291--301 - Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters
**108**(2012): 120503, arxiv:1005.0005 - Mark Culp, George Michailidis and Kjell Johnson, "On multi-view
learning with additive models", Annals of Applied
Statistics
**3**(2009): 292--318 = arxiv:0906.1117 - Marco Cuturi and Kenji Fukumizu, "Multiresolution Kernels", cs.LG/0507033
- H. Daume III, D. Marcu, "Domain Adaptation for Statistical Classifiers", arxiv:1109.6341
- Peter Dayan, "Recurrent Sampling Models for the Helmholtz
Machine," Neural
Computation
**11**(1999): 653--677 - Carlos R. de la Mora B., Carlos Gershenson and Angelica Garcia-Vega, "The role of behavior modifiers in representation development", cs.AI/0403006
- Luc Devroye et al., A Probabilistic Theory of Pattern Recognition
- Thomas G. Dietterich, "Machine Learning for Sequential Data" [PDF. Thanks to Gustavo Lacerda for a pointer.]
- Nicola Di Mauro, Teresa M.A. Basile, Stefano Ferilli, Floriana Esposito, "Feature Construction for Relational Sequence Learning", arxiv:1006.5188
- Pedro Domingos [All from his web-site]
- A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
- Mining High-Speed Data Streams
- Mining Time-Changing Data Streams

- Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
- Deniz Erdogmus, Kenneth E. Hild, II, Yadunandana N. Rao and
José C. Príncipe, "Minimax Mutual Information Approach for
Independent Component Analysis", Neural
Computation
**16**(2004): 1235--1252 - Oleg V. Favorov and Dan Ryder, "SINBAD: A neocortical mechanism for
discovering environmental variables and regularities hidden in sensory
input", Biological
Cybernetics
**90**(2004): 191--202 - Aidan Feeney and Evan Heit (eds.), Inductive Reasoning: Experimental, Developmental, and Computational Approaches
- David Finton, "When Do Differences Matter? On-Line Feature
Extraction Through Cognitive Economy", cs.LG/0404032
= Cognitive
Systems Research
**6**(2005): 263--281 - Gary William Flake, "The Calculus of Jacobian Adaptation"
- Francois Fleuret and Eric Brunet, "DEA: An Architecture for Goal
Planning and Classification," Neural Computation
**12**(2000): 1987--2008 - Flocchini
*et al.*(eds.), Structure, Information and Communication Complexity - Malcolm R. Forster, "How do Simple Rules 'Fit to Reality' in a
Complex World?", Minds and Machines
**9**(1999): 543--564 [A take on the Gigerenzer et al. idea of fast and frugal heuristics, especially their ecological adaptation to the evnironment. "The main purpose of this article is to apply these ideas to learning rules --- methods for constructing, selecting or evaluating competing hypotheses in science, and to the methodology of machine learning... The bad news is that ecological validity is particularly difficult to implement and difficult to understand. The good news is that it builds an important bridge from normative psychology and machine learning to recent work in the philosophy of science, which considers predictive accuracy to be a primary goal of science."] - Paul Franchesi, "A Solution to Goodman's Paradox,"
Dialogue
**40**(2001) [online] - Vinod Goel and Raymond J. Dolan, "Differential involvement of left
prefrontal cortex in inductive and deductive reasoning", Cognition
**93**(2004): B109--B121 - Ulf Grenander, Abstract Inference
- Ulf Grenander and Michael Miller, Pattern Theory: From Representation to Inference
- Laszlo Gyorfi et al., A Distribution-Free Theory of Nonparametric Regression
- Stephen José Hanson et al., eds., Computational
Learning Theory and Natural Learning Systems
- I: Constraints and Prospects
- II: Interactions between Theory and Experiment

- Petr Hajek and Martin Holena, "Formal logics of discovery and
hypothesis formation by machine," Theoretical
Computer Science
**292**(2002): 345-357 - Peter Hall and Qiwei Yao, "Approximating conditional distribution
functions using dimension reduction", math.ST/0507432 = Annals of
Statistics
**33**(2005): 1404--1421 - Gilbert H. Harman, "The Inference to the Best Explanation",
The Philosophical Review
**74**(1965): 88--95 [JSTOR; thanks to Kenny Easwaran for the pointer] - Patrick Heas and Mihai Datcu, "Supervised learning on graphs of spatio-temporal similarity in satellite image sequences", 0709.3013
- David F. Hendry and Jurgen A. Doornik, Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics
- Jaako Hintikka
- Socratic Epistemology: Explorations of Knowledge-Seeking by Questioning
- Inquiry as Inquiry: A Logic of Scientific Discovery

- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka and Hannu
Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and
Approximate Dependencies," The Computer Journal
**42**(1999): 100--111 - Christian Igel and Marc Toussaint, "On Classes of Functions for which No Free Lunch Results Hold," cs.NE/0108011
- Lancelot F. James, David J. Marchette and Carey Priebe, "Consistent
estimation of mixture
complexity", Annals of
Statistics
**29**(2001): 1281--1296 - John R. Josephson and Susan G. Josephson (eds.), Abductive Inference: Computation, Philosophy, Technology
- Yuri Kalnishkan, Vladimir Vovk and Michael V. Vyugin, "How many
strings are easy to predict?", Information and
Computation
**201**(2005): 55--71 ["It is well known in the theory of Kolmogorov complexity that most strings cannot be compressed; more precisely, only exponentially few (O(2^n-m)) binary strings of length n can be compressed by m bits. This paper extends the 'incompressibility' property of Kolmogorov complexity to the 'unpredictability' property of predictive complexity. The 'unpredictability' property states that predictive complexity (defined as the loss suffered by a universal prediction algorithm working infinitely long) of most strings is close to a trivial upper bound (the loss suffered by a trivial minimax constant prediction strategy). We show that only exponentially few strings can be successfully predicted and find the base of the exponent."] - Michael Kearns and Dana Ron, "Algorithmic Stability and
Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural
Computation
**11**(1999): 1427--1453 - Kevin T. Kelly
- The Logic of Reliable Inquiry [Includes cartoons by the author]

- Eric D. Kolaczyk and Robert D. Nowak, "Multiscale likelihood
analysis and complexity penalized estimation", math.ST/0406424 = Annals
of Statistics
**32**(2004): 500--527 - Ingo Kreuz and Dieter Roller, "Relevant Knowledge First: Reinforcement Learning and Forgetting in Knowledge Based Configuration," cs.AI/0109034
- Henry E. Kyburg Jr. and Choh Man Teng, "Evaluating Defaults," cs.AI/0207083
- Steffen Lange and Gunter Grieser, "Variants of iterative
learning," Theoretical
Computer Science
**292**(2002): 359--376 - Nicolas Le Roux and Yoshua Bengio, "Deep Belief Networks Are Compact Universal Approximators", Neural
Computation
**22**(2010): 2192--2207 - F. Liang and A. Barron, "Exact Minimax Strategies for Predictive
Density Estimation, Data Compression, and Model Selection", IEEE Transactions on
Information Theory
**50**(2004): 2708--2726 - Stephen Luttrell, "Using Self-Organising Mappings to Learn the Structure of Data Manifolds", cs.NE/0406017
- David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
- Adrian Mackenzie, Machine Learners: Archaeology of a Data Practice
- Sridhar Mahadevan, Representation Discovery Using Harmonic Analysis
- Gideon S. Mann and Andrew McCallum, "Generalized
expectation criteria for semi-supervised learning with weakly
labeled data", Journal of Machine Learning Research
**11**(2010): 955--984 - Heikki Mannila and Kari-Jouko Räihä, "On the complexity
of inferring functional dependencies," Discrete Applied
Mathematics
**40**(1992): 237--243 - Martin and Osherson, Elements of Scientific Inquiry [A good introduction to the theory of formal learning, especially of recursive functions in the absence of noise. Not even hand-waving that this is a sensible idealization of what scientists do.]
- Conor Mayo-Wilson, Combining Causal Theories and Dividing Scientific Labor [Ph.D. thesis, CMU Philosophy Dept., 2012; thanks to Dr. Mayo-Wilson for a copy]
- Geoffrey J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition
- Abraham Meidan and Boris Levin, "Choosing from Competing Theories
in Computerised Learning", Minds and Machines
**12**(2002): 119--129 - I. J. Myung, Vijay Balasubramanian and M. A. Pitt, "Counting
probability distributions: Differential geometry and model selection",
Proceedings of the National Academy of Sciences (USA)
**97**(2000): 11170--11175 - National Research Council, Massive Data Sets [Online]
- O. Nelles, Nonlinear System Identification
- Ilya Nemenman, "Fluctuation-Dissipation Theorem and Models of
Learning", Neural
Computation
**17**(2005): 2006--2033 ["We analyze how various abstract Bayesian learners perform on different data and argue that it is difficult to determine which learning-theoretic computation is performed by a particular organism using just its performance in learning a stationary target (learning curve). Based on the fluctuation-dissipation relation in statistical physics, we then discuss a different experimental setup that might be able to solve the problem."] - Kamal Nigam and Rayid Ghani, "Analyzing the Effectiveness and Applicability of Co-training", CIKM 2000, pp. 86--93
- Sebastian Nowozin, "Improved Information Gain Estimates for Decision Tree Induction", arxiv:1206.4620
- Liam Paninski, "Asymptotic Theory of Information-Theoretic
Experimental Design", Neural
Computation
**17**(2005): 1480--1507 - Hanchuan Peng, Fuhui Long and Chris Ding, "Feature Selection Based
on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and
Min-Redundancy", IEEE
Transactions on Pattern Analysis and Machine
Intelligence
**27**(2005): 1226--1238 [This sounds like an idea I had in 2002, and was too dumb/lazy to follow up on.] - Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau and Leslie Pack Kaelbling, "Learning to Cooperate via Policy Search," cs.LG/0105032
- Leonid Peshkin and Christian R. Shelton, "Learning from Scarce Experience," cs.AI/0204043
- Karl
Pfleger
- On-Line Learning of Undirected Sparse n-grams
- Learning Predictive Compositional Hierarchies [PS.gz]

- Fenna H. Poletiek, Hypothesis Testing Behaviour [Review by Denny Borsboom]
- Joel B. Predd, Sanjeev R. Kulkarni and H. Vincent Poor
- "Consistency in Models for Distributed Learning under Communication Constraints", cs.IT/0503071
- "Distributed Learning in Wireless Sensor Networks", cs.IT/0503072

- Detlef Prescher, "A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars", cs.CL/0412015
- Vasin Punyakanok and Dan Roth, "The Use of Classifiers in Sequential Inference," cs.LG/0111003
- Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence (eds.), Dataset Shift in Machine Learning
- Maxim Raginsky, "A complexity-regularized quantization approach to nonlinear dimensionality reduction", cs.IT/0501091
- Magnus Rattray, "Stochastic trapping in a solvable model of on-line independent component analysis," cond-mat/0105057
- Salah Rifai, Yoshua Bengio, Yann Dauphin, Pascal Vincent, "A Generative Process for Sampling Contractive Auto-Encoders", ICML 2012, arxiv:1206.6434
- Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito, "On Learning with
Integral
Operators", Journal
of Machine Learning Research
**11**(2010): 905--934 - Dan Roth, "Learning in Natural Language: Theory and Algorithmic Approaches" [online]
- Hichem Sahbi and Donald Geman, "A Hierarchy of Support Vector
Machines for Pattern Detection", Journal of
Machine Learning Research
**7**(2006): 2087--2123 - Erik Sandewall, Features and Fluents: The Representation of Knowledge about Dynamical systems
- Gerhard Schurz
- "Meta-Induction and the Prediction Game: A New View On Hume's Problem" [PDF preprint]
- "Patterns of Abduction" [PDF preprint]

- Alcino J. Silva, Anthony Landreth, and John Bickle, Engineering the Next Revolution in Neuroscience: The New Science of Experiment Planning
- Aris Spanos
- "Statistical Induction, Severe Testing, and Model Validation" [Preprint]
- "Revisiting data mining: `hunting' with or without a
license", Journal of Economic Methodology
**7**(2000): 231--264 [PDF reprint]

- Peter Sollich and Anason Halees, "Learning curves for Gaussian process regression: Approximations and bounds," cond-mat/0105015
- Ray Solomonoff's Papers
- Sonnenberg et al., "The SHOGUN Machine Learning Toolbox",
Journal of Machine Learning Research
**11**(2010): 1799--1802 - Eduardo D Sontag, "Adaptation Implies Internal Model," math.OC/0203228
- Daria Sorokina, Rich Caruana and Mirek Riedewald, "Additive Groves of Regression Trees", ECML 2007 [PDF]
- Daria Sorokina, Rich Caruana, Mirek Riedewald and Daniel Fink, "Detecting Statistical Interactions with Additive Groves of Trees", ICML 2008 [PDF]
- Susanne Still, "Information theoretic approach to interactive learning", arxiv:0709.1948
- Ron Sun and C. L. Giles (eds.), Sequence Learning: Paradigms, Algorithms, and Applications
- Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning
- Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning
as information based boosting," Theoretical
Computer Science
**292**(2002): 447-464 - Sebastian Thrun and Lorien Pratt (eds.), Learning to Learn
- Robert Tibshirani and Larry Wasserman, "Correlation-sharing for detection of differential gene expression", math.ST/0608061 ["Our proposal averages the univariate scores of each feature with the scores in correlation neighborhoods. ... The general idea of correlation-sharing can be applied to other prediction problems involving a large number of correlated features."]
- Nicholas B. Turk-Browne, Brian J. Scholl, Marvin M. Chun, and Marcia K. Johnson, "Neural Evidence of Statistical Learning; Efficient Detection
of Visual Regularities Without Awareness", Journal of Cognitive Neuroscience
**21**(2009): 1934--1945 - Richard Turner, Maneesh Sahani, "A Maximum-Likelihood
Interpretation for Slow Feature Analysis", Neural
Computation
**19**(2007): 1022-1038 - Peter D. Turney, "How to shift bias: Lessons from the Baldwin
effect," Evolutionary Computation
**4**(1996): 271-295 [online] - Satoshi Watanabe, Knowing and Guessing: A Quantitative Study of Inference and Information
- Ying Yang, Xindong Wu and Xingquan Zhu, "Mining in Anticipation for
Concept Change: Proactive-Reactive Prediction in Data
Streams", Data
Mining and Knowledge Discovery
**13**(2006): 261--289 - H. Zha, X. He, C. Ding, M. Gu and H. Simon, "Bipartite Graph Partitioning and Data Clustering," cs.IR/0108018

- To write:
- CRS, Causal Architecture and Model Discovery: Theory, Algorithms and Examples
- CRS, "Three Kinds of Complexity in Prediction: Induction, Estimation and Calculation"