Learning Theory (Formal, Computational or Statistical)
24 Dec 2008 10:34I qualify it to distinguish this area from the broader field of machine learning, which includes much more with lower standards of proof, and from the theory of learning in organisms, which might be quite different.
The basic set-up is as follows. We have a bunch of inputs and outputs, and an unknown relationship between the two. We do have a class of hypotheses describing this relationship, and suppose one of them is correct. (The hypothesis class is always circumscribed, but may be infinite.) A learning algorithm takes in a set of inputs and outputs, its data, and produces a hypothesis. Generally we assume the data are generated by some random process, and the hypothesis changes as the data change. The key notion is that of a probaably approximately correct learning algorithm --- one where, if we supply enough data, we can get a hypothesis with an arbitrarily small error, with a probability arbitrarily close to one.
Generally, PAC-results concern either (1) the existence of a PAC algorithm, (2) quantifying how much data we need, in terms of either accuracy or reliability, or (3) devising new PAC algorithms with other desirable properties. What frustrates me about this literature, and the reason I don't devote more of my research to it (aside, of course, from my sheer incompetence) is that almost all of it assumes the data are statistically independent and identically distributed. Then PAC-like results follow essentially from extensions of the ordinary Law of Large Numbers. What's really needed, however, is something more like an ergodic theorem, for suitably-dependent data.
See also: Frequentist Consistency of Bayesian Procedures; Statistics; Statistics of Structured Data; Universal Prediction Algorithms
- Recommended, bigger picture:
- Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [PDF. 39 pp. review on how to bound the error of your learning algorithms.]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines [While SVMs are one particular technology among others, this book does an xcellent job of crisply introducing the general theory of learning, and showing its practicality.]
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
- John Lafferty and Larry Wasserman, Statistical Machine Learning [Unpublished lecture notes]
- V. N. Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Vapnik-nik learning theory meets empirical process theory]
- Recommended, close-ups (very misc.):
- Sanjeev Arora, Elad Hazan and Satyen Kale, "The Multiplicative Weights Update Method: a Meta Algorithm and Applications " [PDF preprint. This is an interesting kind of result, which promises performance which comes to close that acheived by any strategy within a fixed class, no matter what sequence of data is observed --- but it's performance on that sequence, which, as the saying goes, "is no guarantee of future results". Cesa-Bianchi and Lugosi's book has a lot more along these lines.]
- Moulinath Banerjee, "Covering Numbers and VC dimension" [review notes, UM statistics department, 2004; PS]
- Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
- Andrew Nobel and Amir Dembo, "A Note on Uniform Laws of Averages for Dependent Processes", Statistics and Probability Letters 17 (1993): 169--172 [An extremely easy way to extend uniform laws of large numbers to uniform ergodic theorems for mixing processes. Actually I suspect that mixing is only necessary to get an explicit rate; I should re-read it. PDF preprint via Dr. Nobel.]
- David Pollard
- "Asymptotics via Empirical Processes", Statistical Science 4 (1989): 341--354
- Empirical Processes: Theory and Applications
- Daniil Ryabko, "Pattern Recognition for Conditionally Indpendent Data", cs.LG/0507040 [Generalizing classical learning theory to allow for dependent data. It's a bit of an odd form of dependence: the sequence of labels can show essentially any form of dependence you like, but the features of the nth object have to be independent of everything else, given the label of the nth object. (Hence "conditionally independent data".) This is like some kinds of hidden Markov model...]
- C. Scott and R. Nowak, "A Neyman-Pearson Approach to Statistical Learning", IEEE Transactions on Information Theory 51 (2005): 3806--3819
- To read:
- Andris Ambainis, "Probabilistic and Team PFIN-type Learning: General Properties", cs.LG/0504001
- Alp Atici and Rocco A. Servedio, "Improved Bounds on Quantum Learning Algorithms", quant-ph/0411140
- Jean-Yves Audibert, "Fast learning rates in statistical inference through aggregation", math.ST/0703854
- G. Biau and L. Devroye, "A Note on Density Model Size Testing", IEEE Transactions on Information Theory 50 (2004): 576--581 [Testing which member of a nested family of model classes you need to use]
- Gilles Blanchard and Francois Fleuret, "Occam's hammer: a link between randomized learning and multiple testing FDR control", math.ST/0608713
- Olivier Bousquet and André Elisseeff, "Algorithmic Stability and Generalization Performance" ["We present a novel way of obtaining PAC-style bounds on the generalization of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis spce (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite." Sounds like a formalization of (something close to) Pedro Domingos's "process-oriented evaluation". PDF]
- Arnaud Buhot and Mitra B. Gordon, "Storage Capacity of the Tilinglike Learning Algorithm," cond-mat/0008162
- Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, "Rigorous Bounds to Retarded Learning," cond-mat/0201256
- Andrea Caponnetto and Alexander Rakhlin, "Stability Properties of Empirical Risk Minimization over Donsker Classes", Journal of Machine Learning Research 7 (2006): 2565--2583
- Bruno Caprile, Cesare Furlanello and Stefano Merler, "The Dynamics of AdaBoost Weights Tells You What's Hard to Classify," cs.LG/0201014
- Olivier Catoni, "Improved Vapnik Cervonenkis bounds", math.ST/0410280
- N. Cesa-Bianchi, A. Conconi and C. Gentile, "On the Generalization Ability of On-Line Learning Algorithms", IEEE Transactions on Information Theory 50 (2004): 2050--2057
- Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz, "Improved Second-Order Bounds for Prediction with Expert Advice", math.ST/0602629 ["This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action.... new and sharper regret bounds for the well-known exponentially weighted average forecaster and for a new forecaster with a different multiplicative update rule. ... no preliminary knowledge about the payoff sequence is needed, not even its range; ... bounds are expressed in terms of sums of squared payoffs, replacing larger first-order quantities appearing in previous bounds. In addition, our most refined bounds have the natural and desirable property of being stable under rescalings and general translations of the payoff sequence."]
- Olivier Chapelle, Bernhard Schölkopf and Alexander Zien (eds.), Semi-Supervised Learning [Blurb]
- Felipe Cucker and Ding Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint [Blurb]
- Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722 = math.ST/0410092
- Alexander Gammerman, Vladimir Vovk, "Hedging predictions in machine learning", cs.LG/0611011
- Carl Gold and Peter Sollich, "Model Selection for Support Vector Machine Classification," cond-mat/0203334
- Steve Hanneke
- Ralf Herbrich, Learning Kernel Classifiers: Theory and Algorithms
- Daniel J. L. Herrmann and Dominik Janzing, "Selection Criterion for Log-Linear Models Using Statistical Learning Theory", math.ST/0302079
- Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma, Systems That Learn: An Introduction to Learning Theory
- Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, "Loss functions, complexities, and the Legendre transformation", Theoretical Computer Science 313 (2004): 195--207
- Gerard Kerkyacharian and Dominique Picard, "Thresholding in Learning Theory", math.ST/0510271
- Natalia Komarova and Igor Rivin, "Mathematics of learning," math.PR/0105235
- Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, "Learning Complexity Dimensions for a Continuous-Time Control System," math.OC/0012163
- Niels Landwehr, Kristian Kersting and Luc De Raedt, "Integrating Naive Bayes and FOIL", Journal of Machine Learning Research 8 (2007) 481-507
- John Langford, "Tutorial on Practical Prediction Theory for Classification", Journal of Machine Learning Research 6 (2005): 273--306 [For the PAC-Bayesian result]
- A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, "Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains", 0709.2989
- Guillaume Lecue, "Simultaneous adaptation to the margin and to complexity in classification", math.ST/0509696
- Feng Liang, Sayan Mukherjee, Mike West, "The Use of Unlabeled Data in Predictive Modeling", arxiv:0710.4618 = Statistical Science 22 (2007): 189--205
- Gabor Lugosi and Marten Wegkamp, "Complexity regularization via localized random penalties", Annals of Statistics 32 (2004): 16799--1697 = math.ST/0410091
- Pascal Massart and Élodie Nédélec, "Risk bounds for statistical learning", math.ST/0702683 = Annals of Statistics 34 (2006): 2326--2366
- Andreas Maurer, "A Note on the PAC Bayesian Theorem", cs.LG/0411099
- Shahar Mendelson
- "Improving the Sample Complexity Using Global Data", IEEE Transactions on Information Theory 48 (2002): 1977--1991
- "Lower Bounds for the Empirical Minimization Algorithm", IEEE Transactions on Information Theory 54 (2008): 3797--3803 ["simple argument ... that under mild geometric assumptions .... the empirical minimization algorithm cannot yield a uniform error rate that is faster than $1/sqrt{k}$ in the function learning setup".]
- Shahar Mendelson and Gideon Schechtman, "The shattering dimension of sets of linear functionals", Annals of Probability 32 (2004): 1746--1770 = math.PR/0410096
- P. Mitra, C. A. Murthy and S. K. Pal, "A probabilistic active support vector learning algorithm", IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004): 413--418
- Mehryar Mohri, Afshin Rostamizadeh, "Stability Bound for Stationary Phi-mixing and Beta-mixing Processes", arxiv:0811.1629
- Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner, "Consistent Feature Selection for Pattern Recognition in Polynomial Time", Journal of Machine Learning Research 8 (2007): 589--612
- E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R. Figueiras-Vidal and A. Navia-Vazquez, "Growing support vector classifiers with controlled complexity," Pattern Recognition Letters 36 (2003): 1479--1488
- Sebastian Risau-Gusman and Mirta B. Gordon
- "Generalization Properties of Finite Size Polynomial Support Vector Machines," cond-mat/0002071 =? "Hieararchical Learning in Polynomial Support Vector Machines," Machine Learning 46 (2002): 53--70
- "Learning curves for Soft Margin Classifiers," cond-mat/0203315
- Igor Rivin, "The performance of the batch learner algorithm," cs.LG/0201009
- Cynthia Rudin, "Stability Analysis for Regularized Least Squares Regression", cs.LG/0502016
- Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri, "Are Loss Functions All the Same?", Neural Computation 16 (2004): 1063--1076 [Ans.: pretty much, apparently, at least if they're reasonably convex]
- Glenn Shafer and Vladimir Vovk, "A tutorial on conformal prediction", arxiv:0706.3188
- Xuhui Shao, Vladimir Cherkassky and William Li, "Measuring the VC-Dimension Using Optimized Experimental Design," Neural Computation 12 (2000): 1969--1986
- I. Steinwart, "Consistency of Support Vector Machines and Other Regularized Kernel Classifiers", IEEE Transactions on Information Theory 51 (2005): 128--142
- Joe Suzuki, "On Strong Consistency of Model Selection in Classification", IEEE Transactions on Information Theory 52 (2006): 4767--4774 [Based on information-theoretic criteria]
- Sara A. van de Geer
- "On non-asymptotic bounds for estimation in generalized linear models with highly correlated design", 0709.0844
- "High-dimensional generalized linear models and the lasso", Annals of STatistics 36 (2008): 614--645 = arxiv:0804.0703
- V. Vapnik and O. Chapelle, "Bounds on Error Expectation for Support Vector Machines," Neural Computation 12 (2000): 2013--2036
- Vladimir Vovk
- "Non-asymptotic calibration and resolution", cs.LG/0506004
- "Defensive prediction with expert advice", cs.LG/0506041
- "Competing with wild prediction rules", cs.LG/0512059
- "Predictions as statements and decisions", cs.LG/0606093
- "Leading strategies in competitive on-line prediction", cs.LG/0607134
- "Competing with Markov prediction strategies", cs.LG/0607136
- "Metric entropy in competitive on-line prediction", cs.LG/0609045
- Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic Learning ina Random World [Blurb]
- Vladimir Vovk, Akimichi Takemura, Glenn Shafer, "Defensive forecasting", cs.LG/0505083 [Also AIStats '05]
- T. Zhang, "Learning Bounds for Kernel Regression Using Effective Data Dimensionality", Neural Computation 17 (2005): 2077--2098
