Notebooks

## Machine Learning, Statistical Inference and Induction

01 Jun 2018 10:41

There's a place where AI, statistics and epistemology-methodology converge, or want to anyhow. "Machine learning" is the AI label: how do we make a machine that can find and learn the regularities in a data set? (If the data set is really, really big, and we care mostly about making practically valuable predictions, this becomes data mining, or "knowledge discovery in databases," KDD.) The statisticians ask very similar questions about model-fitting and hypothesis-testing. The epistemologists are mired in the problem of induction, and "inference to the best explanation" (a phrase, I am told by Kenny Easwaran, coined by Gilbert Harman; link below). The fields over-lap in the most crazy-quilt and arbitrary way: I've heard university librarians arguing over whether specific books should go to the engineering or the philosophy library, for instance.

The connection to neuroscience and cognitive science is plain: how on Earth do human beings, and other critters, actually learn? Given that there are many different strategies, which ones do organisms use, and why, and are they good ones? (It's entirely possible that we've gotten locked in to inefficient learning strategies; then the question becomes whether or not they can be improved.) Studying learning by organisms lets us test theories of learning-in-the-abstract, and vice versa: if we had, say, a good proof that a certain learning scheme simply would not work, we'd know that animals don't use it.

One fairly strong result seems to be that tabulae rasae don't work: you've got to give the machine/baby/scientist some hints, or restrict the field of possible hypotheses initially, or you'll never get anywhere. This was at least implicit in Hume, and I believe the other classical empiricists as well, but they don't seem to have been restrictive enough to account for the way we actually do learn. Natural selection is the obvious candidate for having restricted our hypothesis-set, and for having designed our learning mechanisms.

My positivist temperament can hardly help being pleased by this "attempt to introduce the experimental method of reasoning into moral subjects," which, as data mining, has massive industrial applications. My real interest in this isn't, for once, philosophical. Instead, I want to be able to quantify, or at the very least characterize, self-organization, which means I need a good way of automatically finding patterns or regularities in data-sets. For someone who's got the computational mechanics gospel, this means "inferring statistical complexity," and that means the automated construction of abstract-machine or formal-language models of data-sets. (Alternately: Figuring out how natural things compute.) And doing that well means addressing all the issues people in these areas address, so I figure I ought to just steal from them.

See also: Causality; collective cognition; clustering; ensemble methods; grammatical inference; graphical models; learning in games; learning theory; the minimum description length principle; model selection; neural nets; Occam's razor; scientific thinking; sequential decision-making; statistics with structured data; time series; and universal prediction algorithms now get their own notebooks; other topics also need to be spun off from this one.

Recommended, close-ups:
• Shun-ichi Amari, "Information Geometry on Hierarchical Decomposition of Stochastic Interactions," IEEE Transactions on Information Theory 47 (2001): 1701-11 [A way of finding "parts" in complex distributions; uses many differential geometry tricks to do statistics. PDF reprint]
• Massimiliano Badino, "An Application of Information Theory to the Problem of the Scientific Experiment", Synthese 140 (2004): 355--389 [MS Word preprint. See comments under Information Theory.]
• David Balduzzi
<li>H. B. Barlow, "Unsupervised Learning",

• Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
• Mikhail Belkin, Partha Niyogi, Vikas Sindhwani, "Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples", Journal of Machine Learning Research 7 (2006): 2399--2434
• William Bialek, Ilya Nemenman, and Naftali Tishby, "Predictability, Complexity and Learning," physics/0007070
• Ken Binmore, "Making Decisions in Large Worlds" ["This paper argues that we need to look beyond Bayesian decision theory for an answer to the general problem of making rational decisions under uncertainty." PDF manuscript; thanks to Nicolas Della Penna for the pointer]
• Margaret Boden, The Creative Mind: Myths and Mechanisms [How and when to change the kind of representation you're using, a topic shamefully neglected in the literature. Precis]
• Josh Bongard and Hod Lipson, "Automated reverse engineering of nonlinear dynamical systems", Proceedings of the National Academy of Sciences (USA) 104 (2007): 9943--9948 [Thanks to Chris Weed for pointing me to this. Interesting, but basically unaware of the literature on state-space reconstruction in nonlinear dynamics.]
• R. B. Braithwaite, Scientific Explanation
• Arthur W. Burks, "Peirce's Theory of Abduction", Philosophy of Science 13 (1946): 301--306 [JSTOR; ungated copy]
• Venkat Chandrasekaran and Michael I. Jordan, "Computational and Statistical Tradeoffs via Convex Relaxation", Proceedings of the National Academy of Sciences (USA) 110 (2013): E1181--E1190, arxiv:1211.1073
• Pedro Domingos
• "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online]
• "A Few Useful Things to Know about Machine Learning" [PDF preprint]
• Marco Dorigo and Marco Colombetti, Robot Shaping: An Experiment in Behavior Engineering [Review: Crawling Towards the Light]
• John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
• Francois Fleuret and Donald Geman, "Stationary Features and Cat Detection", Journal of Machine Learning Research 9 (2008): 2549--2578
• Peter Godfrey-Smith, "Inductions, Samples, and Kinds" [PDF preprint]
• David J. Hand, "Classifier Technology and the Illusion of Progress", Statistical Science 21 (2006): 1--15, math.ST/0606441 [Or: don't believe everything you read in ICML! With commentary, available from the arxiv.org link]
• Hinton and Sejnowski (eds.), Unsupervised Learning [A sort of "Neural Computation's Greatest Hits" compilation]
• Tommi S. Jaakkola and David Haussler, "Exploiting generative models in discriminative classifiers", NIPS 11 (1998) [PDF]
• Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
• Kevin T. Kelly [Kelly's work on Occam's Razor is, so far as I know, the only justification for it which doesn't either massively beg the question, change the subject, or make massive assumptions about the nature of the world, Divine Providence, etc.]
• Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
• Jerzy Neyman, First Course in Probability and Statistics [Fine explanation of his ideas about "rules of inductive behavior" --- which probably isn't very good methodology, but has the makings of excellent robotics]
• Leonid Peshkin, "Structure induction by lossless graph compression", cs.DS/0703132 [Adapting data-compression ideas to discover hierarchical structures in graphs, e.g., the 4 bases from a tinker-toy model of DNA.]
• Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
• Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer and Andrew Y. Ng, "Self-taught learning: Transfer learning from unlabeled data", ICML 2007 [PDF. This is a clever idea for semi-supervised learning. Given a big supply of unlabeled examples, and a small number of labeled examples, use the unlabeled ones to learn a high-level/abstract representation or set of features. Then use those features in straightforward classifier learning on the labeled data. (They have a specific idea for learning the higher-level representation, by basis selection, but that's a separable issue.)]
• Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
• Gerhard Schurz, "Universal vs. Local Prediction Strategies: A Game-Theoretical Approach to the Problem of Induction", phil-sci/3720 [Slides only?!?]
• John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis
• Spyros Skouras, "Decisionmetrics: Towards a Decision-Based Approach to Econometrics" [Suppose what you really want to do with your model is to make decisions, e.g., to buy and sell and make money doing so. Then fitting the model to minimize a standard error measure, e.g., mean square error, often gives worse performance than fitting the model to minimize expected losses. This applies much more broadly than Spyros's financial examples may suggest.]
• Aris Spanos, "The Curve-Fitting Problem, Akaike-type Model Selection, and the Error Statistical Approach" [PDF preprint]
• Sara van de Geer, Applications of Empirical Process Theory [A.k.a. Empirical Process Theory in M-Estimation]
• Greg Ver Steeg, Aram Galstyan, "Discovering Structure in High-Dimensional Data Through Correlation Explanation", arxiv:1406.1222
• Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic Learning in a Random World [Mini-review]
• Blaz Zupan, Marko Bohanec, Janez Demsar and Ivan Bratko, "Learning by discovering concept hierarchies", Artificial Intelligence 109 (1999): 211--242 [Thanks to Aleks Jakulin for letting me know about this. PDF preprint]

Not exactly recommended:
• Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]
• Jacob Feldman, "How surprising is a simple pattern? Quantifying 'Eureka!'," Cognition 93(2004): 199--224 [Claims to (a) have a psychologically valid measure of subjective complexity, and (b) derive a null distribution for it. But the evidence that his particular complexity measure captures what people do in concept-learning problems is deferred to other papers.]
• Gilbert Harman and Sanjeev Kulkarni, Reliable Reasoning: Induction and Statistical Learning Theory [Published by MIT Press; 2006 draft free online via Prof. Kulkarni (about 100 pages). The technical material on learning theory is mostly alright, so far as it goes, but the philosophy is irritatingly lack-luster. Definitely not worth paying what the publisher charges for it. — There is now a good review by Kevin Kelly and Conor Mayo-Wilson.]
• Steven Abney, "Bootstrapping", ACL 2002, pp. 360--367 [In the sense of "a problem setting in which one is given a small set of labeled data and a large set of unlabeled data, and the task is to induce a classifier", not the famous statistical procedure]
• Tatsuya Akutsu, Satoru Miyanoa and Satoru Kuhar, "A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis," Theoretical Computer Science 292 (2002): 481--495
• Atocha Aliseda, Abductive Reasoning: Logical Investigations into Discovery and Explanation
• Andris Ambainis, "Probabilistic inductive inference: a survey", cs.LG/9902026 [Taking "inductive inference" exclusively in the sense of learning recursive functions]
• Rosa I. Arriaga and Santosh Vempala, "An algorithmic theory of learning: Robust concepts and random projection", Machine Learning 63 (2006): 161--182
• Nihat Ay
• "Locality of global stochastic interaction in directed acyclic networks," preprint, MPI-MIS 54/2001
• "An information geometric approach to a theory of pragmatic structuring," MPI-MIS 52/2000
<li>Vijay Balasubramanian, "Statistical Inference, Occam's Razor,


and Statistical Mechanics on the Space of Probability Distributions", Neural Computation 9 (1997): 349--368, arxiv:cond-mat/9601030

• Pierre Baldi et al., Modeling the Internet and the Web: Probabilistic Methods and Algorithms
• Jayanta Basak, "Online Adaptive Decision Trees", Neural Computation 16 (2004): 1959--1981
• William Bechtel and Robert C. Richardson, Discovering Complexity: Decomposition and Localization as Strategies in Scientific Research
• Sergey V. Beiden, Marcus A. Maloof and Robert F. Wagner, "A General Model for Finite-Sample Effects in Training and Testing of Competing Classifiers", IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003): 1561--1569
• Ron Bekkerman, Mikhail Bilenko and John Langford (eds.), Scaling up Machine Learning: Parallel and Distributed Approaches
• D. Paul Benjamin (ed.), Change of Representation and Inductive Bias
• James Blachowicz, Of Two Minds: The Nature of Inquiry [From the back cover: "The logic of correction developed here directly opposes the claim made by evolutionary epistemologists such as Popper and Campbell that there is no such thing as a 'logical method for having new ideas.' ... This comprehensive and revolutionary theory challenges traditional epistemology's conception of justification and provides substantial new interpretations of the nature of ampliative inference, representation and meaning, Platonic and Hegelian dialectic, Kantian analysis, the heuristic function of models and metaphors, and the role of inquiry in the constitution of human consciousness." All this in only four hundred pages! But the stuff on a logic of correction is very important --- if correct.]
• Gilles Blanchard and Donald Geman, "Hierarchical testing designs for pattern recognition", math.ST/0507421 = Annals of Statistics 33 (2005): 1155--1202
• Hendrik Blockeel and Jan Struyf, "Efficient algorithms for decision tree cross-validation," cs.LG/0110036
• Avrim Blum, and Tom Mitchell, "Combing Labeled and Unlabeled Data with Co-Training", COLT 98, pp. 92--100
• Avrim Blum, Adam Kalai and Hal Wasserman, "Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model," cs.LG/0010022
• Leo Breiman, "Prediction Games and Arcing Algorithms," Neural Computation 11 (1999): 1493--1517
• Robert Alan Brown, Machines that Learn: Based on the Principle of Empirical Control
• Christopher J. C. Burges, "Dimension Reduction: A Guided Tour", Foundations and Trends in Machine Learning 2:4 (2010) [Preprint version]
• Meir Buzaglo, The Logic of Concept Expansion
• Adam Cannon, J. Mark Ettinger, Don Hush, and Clint Scovel, "Machine Learning with Data Dependent Hypothesis Classes," Journal of Machine Learning Research 2 (2002): 335--358
• Philip Ellery Catton, "The Justification(s) of Induction(s)," online
• Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets Using Efficient Estimation of High-Dimensional Mutual Information", IEEE Transactions on Neural Networks 16 (2005): 213--224
• Andy Clark and Chris Thornton, "Trading Spaces: Computation, Representation and the Limits of Uninformed Learning," Behavioral and Brain Sciences (1997) 20:57--90 [Draft]
• Bertrand Clarke, "Desiderata for a Predictive Theory of Statistics", Bayesian Analysis 5 (2010): 1--36
• David Corfield, "Varieties of Justification in Machine Learning", Minds and Machines 20 (2010): 291--301
• Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
• Mark Culp, George Michailidis and Kjell Johnson, "On multi-view learning with additive models", Annals of Applied Statistics 3 (2009): 292--318 = arxiv:0906.1117
• Marco Cuturi and Kenji Fukumizu, "Multiresolution Kernels", cs.LG/0507033
• H. Daume III, D. Marcu, "Domain Adaptation for Statistical Classifiers", arxiv:1109.6341
• Peter Dayan, "Recurrent Sampling Models for the Helmholtz Machine," Neural Computation 11 (1999): 653--677
• Carlos R. de la Mora B., Carlos Gershenson and Angelica Garcia-Vega, "The role of behavior modifiers in representation development", cs.AI/0403006
• Luc Devroye et al., A Probabilistic Theory of Pattern Recognition
• Thomas G. Dietterich, "Machine Learning for Sequential Data" [PDF. Thanks to Gustavo Lacerda for a pointer.]
• Nicola Di Mauro, Teresa M.A. Basile, Stefano Ferilli, Floriana Esposito, "Feature Construction for Relational Sequence Learning", arxiv:1006.5188
• Pedro Domingos [All from his web-site]
• A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
• Mining High-Speed Data Streams
• Mining Time-Changing Data Streams
• Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
• Deniz Erdogmus, Kenneth E. Hild, II, Yadunandana N. Rao and José C. Príncipe, "Minimax Mutual Information Approach for Independent Component Analysis", Neural Computation 16 (2004): 1235--1252
• Oleg V. Favorov and Dan Ryder, "SINBAD: A neocortical mechanism for discovering environmental variables and regularities hidden in sensory input", Biological Cybernetics 90 (2004): 191--202
• Aidan Feeney and Evan Heit (eds.), Inductive Reasoning: Experimental, Developmental, and Computational Approaches
• David Finton, "When Do Differences Matter? On-Line Feature Extraction Through Cognitive Economy", cs.LG/0404032 = Cognitive Systems Research 6 (2005): 263--281
• Gary William Flake, "The Calculus of Jacobian Adaptation"
• Francois Fleuret and Eric Brunet, "DEA: An Architecture for Goal Planning and Classification," Neural Computation 12 (2000): 1987--2008
• Flocchini et al. (eds.), Structure, Information and Communication Complexity
• Malcolm R. Forster, "How do Simple Rules 'Fit to Reality' in a Complex World?", Minds and Machines 9 (1999): 543--564 [A take on the Gigerenzer et al. idea of fast and frugal heuristics, especially their ecological adaptation to the evnironment. "The main purpose of this article is to apply these ideas to learning rules --- methods for constructing, selecting or evaluating competing hypotheses in science, and to the methodology of machine learning... The bad news is that ecological validity is particularly difficult to implement and difficult to understand. The good news is that it builds an important bridge from normative psychology and machine learning to recent work in the philosophy of science, which considers predictive accuracy to be a primary goal of science."]
• Paul Franchesi, "A Solution to Goodman's Paradox," Dialogue 40 (2001) [online]
• Vinod Goel and Raymond J. Dolan, "Differential involvement of left prefrontal cortex in inductive and deductive reasoning", Cognition 93 (2004): B109--B121
• Ulf Grenander, Abstract Inference
• Ulf Grenander and Michael Miller, Pattern Theory: From Representation to Inference
• Laszlo Gyorfi et al., A Distribution-Free Theory of Nonparametric Regression
• Stephen José Hanson et al., eds., Computational Learning Theory and Natural Learning Systems
• I: Constraints and Prospects
• II: Interactions between Theory and Experiment
• Petr Hajek and Martin Holena, "Formal logics of discovery and hypothesis formation by machine," Theoretical Computer Science 292 (2002): 345-357
• Peter Hall and Qiwei Yao, "Approximating conditional distribution functions using dimension reduction", math.ST/0507432 = Annals of Statistics 33 (2005): 1404--1421
• Gilbert H. Harman, "The Inference to the Best Explanation", The Philosophical Review 74 (1965): 88--95 [JSTOR; thanks to Kenny Easwaran for the pointer]
• Patrick Heas and Mihai Datcu, "Supervised learning on graphs of spatio-temporal similarity in satellite image sequences", 0709.3013
• David F. Hendry and Jurgen A. Doornik, Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics
• Jaako Hintikka
• Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka and Hannu Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies," The Computer Journal 42 (1999): 100--111
• Christian Igel and Marc Toussaint, "On Classes of Functions for which No Free Lunch Results Hold," cs.NE/0108011
• Lancelot F. James, David J. Marchette and Carey Priebe, "Consistent estimation of mixture complexity", Annals of Statistics 29 (2001): 1281--1296
• John R. Josephson and Susan G. Josephson (eds.), Abductive Inference: Computation, Philosophy, Technology
• Yuri Kalnishkan, Vladimir Vovk and Michael V. Vyugin, "How many strings are easy to predict?", Information and Computation 201 (2005): 55--71 ["It is well known in the theory of Kolmogorov complexity that most strings cannot be compressed; more precisely, only exponentially few (O(2^n-m)) binary strings of length n can be compressed by m bits. This paper extends the 'incompressibility' property of Kolmogorov complexity to the 'unpredictability' property of predictive complexity. The 'unpredictability' property states that predictive complexity (defined as the loss suffered by a universal prediction algorithm working infinitely long) of most strings is close to a trivial upper bound (the loss suffered by a trivial minimax constant prediction strategy). We show that only exponentially few strings can be successfully predicted and find the base of the exponent."]
• Michael Kearns and Dana Ron, "Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation," Neural Computation 11 (1999): 1427--1453
• Kevin T. Kelly
• The Logic of Reliable Inquiry [Includes cartoons by the author]
• Eric D. Kolaczyk and Robert D. Nowak, "Multiscale likelihood analysis and complexity penalized estimation", math.ST/0406424 = Annals of Statistics 32 (2004): 500--527
• Ingo Kreuz and Dieter Roller, "Relevant Knowledge First: Reinforcement Learning and Forgetting in Knowledge Based Configuration," cs.AI/0109034
• Henry E. Kyburg Jr. and Choh Man Teng, "Evaluating Defaults," cs.AI/0207083
• Steffen Lange and Gunter Grieser, "Variants of iterative learning," Theoretical Computer Science 292 (2002): 359--376
• Nicolas Le Roux and Yoshua Bengio, "Deep Belief Networks Are Compact Universal Approximators", Neural Computation 22 (2010): 2192--2207
• F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
• Stephen Luttrell, "Using Self-Organising Mappings to Learn the Structure of Data Manifolds", cs.NE/0406017
• David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
• Adrian Mackenzie, Machine Learners: Archaeology of a Data Practice
• Sridhar Mahadevan, Representation Discovery Using Harmonic Analysis
• Gideon S. Mann and Andrew McCallum, "Generalized expectation criteria for semi-supervised learning with weakly labeled data", Journal of Machine Learning Research 11 (2010): 955--984
• Heikki Mannila and Kari-Jouko Räihä, "On the complexity of inferring functional dependencies," Discrete Applied Mathematics 40 (1992): 237--243
• Martin and Osherson, Elements of Scientific Inquiry [A good introduction to the theory of formal learning, especially of recursive functions in the absence of noise. Not even hand-waving that this is a sensible idealization of what scientists do.]
• Conor Mayo-Wilson, Combining Causal Theories and Dividing Scientific Labor [Ph.D. thesis, CMU Philosophy Dept., 2012; thanks to Dr. Mayo-Wilson for a copy]
• Geoffrey J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition
• Abraham Meidan and Boris Levin, "Choosing from Competing Theories in Computerised Learning", Minds and Machines 12 (2002): 119--129
• I. J. Myung, Vijay Balasubramanian and M. A. Pitt, "Counting probability distributions: Differential geometry and model selection", Proceedings of the National Academy of Sciences (USA) 97 (2000): 11170--11175
• National Research Council, Massive Data Sets [Online]
• O. Nelles, Nonlinear System Identification
• Ilya Nemenman, "Fluctuation-Dissipation Theorem and Models of Learning", Neural Computation 17 (2005): 2006--2033 ["We analyze how various abstract Bayesian learners perform on different data and argue that it is difficult to determine which learning-theoretic computation is performed by a particular organism using just its performance in learning a stationary target (learning curve). Based on the fluctuation-dissipation relation in statistical physics, we then discuss a different experimental setup that might be able to solve the problem."]
• Kamal Nigam and Rayid Ghani, "Analyzing the Effectiveness and Applicability of Co-training", CIKM 2000, pp. 86--93
• Sebastian Nowozin, "Improved Information Gain Estimates for Decision Tree Induction", arxiv:1206.4620
• Liam Paninski, "Asymptotic Theory of Information-Theoretic Experimental Design", Neural Computation 17 (2005): 1480--1507
• Hanchuan Peng, Fuhui Long and Chris Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 1226--1238 [This sounds like an idea I had in 2002, and was too dumb/lazy to follow up on.]
• Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau and Leslie Pack Kaelbling, "Learning to Cooperate via Policy Search," cs.LG/0105032
• Leonid Peshkin and Christian R. Shelton, "Learning from Scarce Experience," cs.AI/0204043
• Karl Pfleger
• On-Line Learning of Undirected Sparse n-grams
• Learning Predictive Compositional Hierarchies [PS.gz]
• Fenna H. Poletiek, Hypothesis Testing Behaviour [Review by Denny Borsboom]
• Joel B. Predd, Sanjeev R. Kulkarni and H. Vincent Poor
• "Consistency in Models for Distributed Learning under Communication Constraints", cs.IT/0503071
• "Distributed Learning in Wireless Sensor Networks", cs.IT/0503072
• Detlef Prescher, "A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars", cs.CL/0412015
• Vasin Punyakanok and Dan Roth, "The Use of Classifiers in Sequential Inference," cs.LG/0111003
• Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence (eds.), Dataset Shift in Machine Learning
• Maxim Raginsky, "A complexity-regularized quantization approach to nonlinear dimensionality reduction", cs.IT/0501091
• Magnus Rattray, "Stochastic trapping in a solvable model of on-line independent component analysis," cond-mat/0105057
• Salah Rifai, Yoshua Bengio, Yann Dauphin, Pascal Vincent, "A Generative Process for Sampling Contractive Auto-Encoders", ICML 2012, arxiv:1206.6434
• Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito, "On Learning with Integral Operators", Journal of Machine Learning Research 11 (2010): 905--934
• Dan Roth, "Learning in Natural Language: Theory and Algorithmic Approaches" [online]
• Hichem Sahbi and Donald Geman, "A Hierarchy of Support Vector Machines for Pattern Detection", Journal of Machine Learning Research 7 (2006): 2087--2123
• Erik Sandewall, Features and Fluents: The Representation of Knowledge about Dynamical systems
• Gerhard Schurz
• "Meta-Induction and the Prediction Game: A New View On Hume's Problem" [PDF preprint]
• "Patterns of Abduction" [PDF preprint]
• Alcino J. Silva, Anthony Landreth, and John Bickle, Engineering the Next Revolution in Neuroscience: The New Science of Experiment Planning
• Aris Spanos
• "Statistical Induction, Severe Testing, and Model Validation" [Preprint]
• "Revisiting data mining: hunting' with or without a license", Journal of Economic Methodology 7 (2000): 231--264 [PDF reprint]
• Peter Sollich and Anason Halees, "Learning curves for Gaussian process regression: Approximations and bounds," cond-mat/0105015
• Ray Solomonoff's Papers
• Sonnenberg et al., "The SHOGUN Machine Learning Toolbox", Journal of Machine Learning Research 11 (2010): 1799--1802
• Eduardo D Sontag, "Adaptation Implies Internal Model," math.OC/0203228
• Daria Sorokina, Rich Caruana and Mirek Riedewald, "Additive Groves of Regression Trees", ECML 2007 [PDF]
• Daria Sorokina, Rich Caruana, Mirek Riedewald and Daniel Fink, "Detecting Statistical Interactions with Additive Groves of Trees", ICML 2008 [PDF]
• Susanne Still, "Information theoretic approach to interactive learning", arxiv:0709.1948
• Ron Sun and C. L. Giles (eds.), Sequence Learning: Paradigms, Algorithms, and Applications
• Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning
• Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science 292 (2002): 447-464
• Sebastian Thrun and Lorien Pratt (eds.), Learning to Learn
• Robert Tibshirani and Larry Wasserman, "Correlation-sharing for detection of differential gene expression", math.ST/0608061 ["Our proposal averages the univariate scores of each feature with the scores in correlation neighborhoods. ... The general idea of correlation-sharing can be applied to other prediction problems involving a large number of correlated features."]
• Nicholas B. Turk-Browne, Brian J. Scholl, Marvin M. Chun, and Marcia K. Johnson, "Neural Evidence of Statistical Learning; Efficient Detection of Visual Regularities Without Awareness", Journal of Cognitive Neuroscience 21 (2009): 1934--1945
• Richard Turner, Maneesh Sahani, "A Maximum-Likelihood Interpretation for Slow Feature Analysis", Neural Computation 19 (2007): 1022-1038
• Peter D. Turney, "How to shift bias: Lessons from the Baldwin effect," Evolutionary Computation 4 (1996): 271-295 [online]
• Satoshi Watanabe, Knowing and Guessing: A Quantitative Study of Inference and Information
• Ying Yang, Xindong Wu and Xingquan Zhu, "Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams", Data Mining and Knowledge Discovery 13 (2006): 261--289
• H. Zha, X. He, C. Ding, M. Gu and H. Simon, "Bipartite Graph Partitioning and Data Clustering," cs.IR/0108018

<ul>To write:
<li>CRS, <cite>Causal Architecture and Model Discovery: Theory,
`

Algorithms and Examples

• CRS, "Three Kinds of Complexity in Prediction: Induction, Estimation and Calculation"