## Learning Theory (Formal, Computational or Statistical)

*03 Jun 2016 15:41*

I qualify it to distinguish this area from the broader field of machine learning, which includes much more with lower standards of proof, and from the theory of learning in organisms, which might be quite different.

The basic set-up is as follows. We have a bunch of inputs and outputs, and
an unknown relationship between the two. We do have a class of hypotheses
describing this relationship, and suppose one of them is correct. (The
hypothesis class is always circumscribed, but may be infinite.) A learning
algorithm takes in a set of inputs and outputs, its data, and produces a
hypothesis. Generally we assume the data are generated by some random
process, and the hypothesis changes as the data change. The key notion is that
of a *probably approximately correct* learning algorithm --- one where,
if we supply enough data, we can get a hypothesis with an arbitrarily small
error, with a probability arbitrarily close to one.

Generally, PAC-results concern (1) the existence of a PAC algorithm, (2) quantifying how much data we need, in terms of either accuracy or reliability, or (3) devising new PAC algorithms with other desirable properties. What frustrates me about this literature, and the reason I don't devote more of my research to it (aside, of course, from my sheer incompetence) is that almost all of it assumes the data are statistically independent and identically distributed. Then PAC-like results follow essentially from extensions of the ordinary Law of Large Numbers. What's really needed, however, is something more like an ergodic theorem, for suitably-dependent data. That, however, gets its own notebook.

An interesting question (which I learned of from Vidyasagar's book) has to
do with the difference between distribution-free and distribution-dependent
bounds. Generally, the latter are sharper, sometimes much sharper, but this
comes at the price of making more or less strong parametric assumptions about
the distribution. (One might indeed think of the theory of parametric
statistical inference as learning theory with *very* strong
distributional assumptions.) However, even in the distribution-free set up,
we *have* a whole bunch of samples from the distribution, and
non-parametric density estimation is certainly possible --- could one, e.g.,
improve the bounds by using half the sample to estimate the distribution, and
then applying a distribution-dependent bound? Or will the uncertainty in the
distributional estimate necessarily kill any advantage we might get from
learning about the distribution? It feels like the latter would say something
pretty deep (and depressing) about the whole project of observational
science...

To learn more about: stability-based arguments.

See also: Concentration of Measure; Decision Theory; Empirical Process Theory; Ensemble Methods in Machine Learning; Frequentist Consistency of Bayesian Procedures; Low-Regret Learning; Statistics; Statistics of Structured Data; Universal Prediction Algorithms

- Recommended, bigger picture:
- Martin Anthony and Peter C. Bartlett, Neural Network Learning: Theoretical
Foundations [The theory is
*much*broader than just neural networks] - Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [PDF. 39 pp. review on how to bound the error of your learning algorithms.]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines [While SVMs are one particular technology among others, this book does an excellent job of crisply introducing the general theory of learning, and showing its practicality.]
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
- John Lafferty and Larry Wasserman, Statistical Machine Learning [Unpublished lecture notes]
- Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning [Review]
- Maxim Raginsky, Statistical Learning Theory [Class webpage, with excellent notes and further readings]
- V. N. Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Mini-review.]

- Recommended, close-ups (very misc.):
- Terrence M. Adams and Andrew B. Nobel, "Uniform Approximation of Vapnik-Chervonenkis Classes", arxiv:1010.4515
- Alekh Agarwal, John C. Duchi, Peter L. Bartlett, Clement Levrard, "Oracle inequalities for computationally budgeted model selection" [COLT 2011]
- David Balduzzi
- "Information, learning and falsification", arxiv:1110.3592
- "Falsification and future performance", arxiv:1111.5648

- Moulinath Banerjee, "Covering Numbers and VC dimension" [review notes, UM statistics department, 2004; PS]
- Peter L. Bartlett and Shahar Mendelson, "Rademacher and Gaussian
Complexities: Risk Bounds and Structural
Results", Journal
of Machine Learning Research
**3**(2002): 463--482 - Jonathan Baxter, "A Model of Inductive Bias Learning,"
Journal of Artificial Intelligence Research
**12**(2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.] - Olivier Bousquet and André Elisseeff, "Stability and
Generalization", Journal
of Machine Learning Research
**2**(2002): 499--526 [A very cool paper, resting on the idea that if your learning algorithm is stable under small perturbations of the data, and you have a lot of data, then how well it predicts the training data has to be a good estimate of how well it will predict future data. Clearly related to Pedro Domingos's "process-oriented evaluation", which they don't mention, but more formal.] - Olivier Catoni, Statistical Learning Theory and Stochastic Optimization [Mini-review]
- N. Cesa-Bianchi, A. Conconi and C. Gentile, "On the Generalization
Ability of On-Line Learning
Algorithms", IEEE
Transactions on Information Theory
**50**(2004): 2050--2057 - Corinna Cortes, Spencer Greenberg, Mehryar Mohri, "Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions", arxiv:1310.5796
- Corinna Cortes, Yishay Mansour and Mehryar Mohri, "Learning Bounds for Importance Weights", NIPS 23 (2010) [From my point of view, the most interesting part of this is actually the very elegant generalization bound for unbounded loss functions, developed in the supplemental materials, and now followed up by Cortes, Greenberg and Mohri. Comments.]
- Lee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer, "Efficient Regression in Metric Spaces via Approximate Lipschitz Extension", arxiv:1111.4470
- Ralf Herbrich, Robert C. Williamson, "Algorithmic Luckiness",
Journal of Machine Learning Research
**3**(2002): 175--212 - Michael Jordan, Theoretical Statistics 210B [Link is to the Spring 2008 version of the course, which is the latest I can find right now.]
- Michael J. Kearns and Dana Ron, "Algorithmic stability and sanity-check bounds for leave-one-out cross-validation", Neural Computation
**11**(1999): 1427--1453 [PDF preprint via Prof. Kearns] - Aryeh Kontorovich, "Concentration in unbounded metric spaces and algorithmic stability", arxiv:1309.1007
- Jing Lei, James Robins, Larry Wasserman, "Efficient Nonparametric Conformal Prediction Regions", arxiv:1111.1418
- Ben London, Bert Huang, Lise Getoor, "Graph-based Generalization Bounds for Learning Binary Relations", arxiv:1302.5348
- Gabor Lugosi, Shahar Mendelson and Vladimir Koltchinskii, "A note
on the richness of convex hulls of VC classes", Electronic Communications
in Probability
**8**(2003): 18 ["We prove the existence of a class $ A $ of subsets of $ R^d $ of VC dimension 1 such that the symmetric convex hull $ F $ of the class of characteristic functions of sets in $ A $ is rich in the following sense. For any absolutely continuous probability measure $ m $ on $ R^d $, measurable set $ B $ and $ h >0 $, there exists a function $ f $ in $ F $ such that the measure of the symmetric difference of $ B $ and the set where $ f $ is positive is less than $ h $." The astonishingly simple proof turns on the Borel isomorphism theorem, which says that says the Borel sets of any complete, separable metric space are in one-to-one correspondence with the Borel sets of the unit interval.] - Shahar Mendelson, "Learning without Concentration", COLT 2014: 25--39, arxiv:1401.0304
- David Pollard
- "Asymptotics via Empirical Processes",
Statistical Science
**4**(1989): 341--354 - Empirical Processes: Theory and Applications [Review]

- "Asymptotics via Empirical Processes",
Statistical Science
- Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
- Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari, "Online Learning: Random Averages, Combinatorial Parameters, and Learnability", arxiv:1006.1138 [This is not an easy paper to read, but the results are quite deep, and it repays the effort.]
- Philippe Rigollet and Xin Tong, "Neyman-Pearson classification,
convexity and stochastic constraints", Journal of Machine Learning
Research
**12**(2011): 2831--2855, arxiv:1102.5750 - Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
- C. Scott and R. Nowak, "A Neyman-Pearson Approach to Statistical
Learning", IEEE
Transactions on Information Theory
**51**(2005): 3806--3819 [Comments] - Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik
Sridharan, "Learnability, Stability and Uniform
Convergence", Journal
of Machine Learning Research
**11**(2010): 2635--2670 [If one looks at a broader domain than the usual supervised regression or classification problems, uniform convergence of risks is not required for learnability, but the existence of a stable learning algorithm is. Of course much turns on definitions here, and I'm not completely certain, after only one reading, that I really buy all of theirs...] - John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies" [PDF via Prof. Williamson]
- Nathan Srebro, Karthik Sridharan, Ambuj Tewari, "Smoothness, Low-Noise and Fast Rates", arxiv:1009.3896
- J. Michael Steele, The Scary Sequences Project [Notes and references on individual-sequence prediction]
- Matus Telgarsky and Sanjoy Dasgupta, "Moment-based Uniform Deviation Bounds for $k$-means and Friends", arxiv:1311.1903
- Sara van de Geer, Empirical Processes in M-Estimation
- Ramon van Handel, "The universal Glivenko-Cantelli property",
Probability Theory and Related Fields
**155**(2013): 911--934, arxiv:1009.4434 - Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic Learning in a Random World [Mini-review]
- Xiaojin Zhu, Timothy T. Rogers and Bryan R. Gibson, "Human Rademacher Complexity", in Advances in Neural Information Processing, vol. 22 (NIPS 2009) [my commentary]

- Modesty forbids me to recommend:
- Daniel J. McDonald, CRS and Mark Schervish, "Estimated VC Dimension for Risk Bounds", arxiv:1111.3404 [More]

- To read:
- Yohji Akama, Kei Irie, "VC dimension of ellipsoids", arxiv:1109.4347
- Pierre Alquier, "PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers", arxiv:0712.1698
- Andris Ambainis, "Probabilistic and Team PFIN-type Learning: General Properties", cs.LG/0504001
- Sylvain Arlot, Peter L. Bartlett, "Margin-adaptive model selection in statistical learning", Bernoulli
**17**(2011): 687--713, arxiv:0804.2937 - Jean-Yves Audibert, "Fast learning rates in statistical inference
through aggregation", Annals of Statistics
**37**(2009): 1591--1646, math.ST/0703854 - Jean-Yves Audibert and Olivier Bousquet, "Combining PAC-Bayesian
and Generic Chaining
Bounds", Journal
of Machine Learning Research
**8**(2007): 863--889 - Francis Bach, "Sharp analysis of low-rank kernel matrix approximations", arxiv:1208.2015
- Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon, "Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels", arxiv:1003.0470
- Peter L. Bartlett, Olivier Bousquet, Shahar Mendelson, "Local Rademacher complexities", Annals of Statistics
**33**(2005): 1497--1537, arxiv:math/0508275 - Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, Ronald A. DeVore, "Approximation and learning by greedy algorithms", Annals of Statistics
**36**(2008): 64--94, arxiv:0803.1718 - G. Biau and L. Devroye, "A Note on Density Model Size Testing",
IEEE Transactions on Information Theory
**50**(2004): 576--581 [Testing which member of a nested family of model classes you need to use] - Gilles Blanchard, Olivier Bousquet, Pascal Massart, "Statistical performance of support vector machines", Annals of Statistics
**36**(2008): 489--531, arxiv:0804.0551 - Gilles Blanchard and Francois Fleuret, "Occam's hammer: a link between randomized learning and multiple testing FDR control", math.ST/0608713
- Christian Brownlees, Emilien Joly, Gabor Lugosi, "Empirical risk minimization for heavy-tailed losses", arxiv:1406.2462
- Arnaud Buhot and Mitra B. Gordon, "Storage Capacity of the Tilinglike Learning Algorithm," cond-mat/0008162
- Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, "Rigorous Bounds to Retarded Learning," cond-mat/0201256
- Stephane Canu, Xavier Mary, Alain Rakotomamonjy, "Functional learning through kernels", arxiv:0910.1013
- Andrea Caponnetto and Alexander Rakhlin, "Stability Properties of
Empirical Risk Minimization over Donsker
Classes", Journal
of Machine Learning Research
**7**(2006): 2565--2583 - Olivier Catoni
- "Improved Vapnik Cervonenkis bounds", math.ST/0410280
- Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning [Full-text open access]

- Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate, "Differentially Private Empirical Risk Minimization", Journal of Machine Learning Research
**12**(2011): 1069--1109 - Olivier Chapelle, Bernhard Schölkopf and Alexander Zien (eds.), Semi-Supervised Learning
- Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical
equations from experimental data is NP-hard", Physical Review
Letters
**108**(2012): 120503, arxiv:1005.0005 - Felipe Cucker and Ding Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
- Arnak Dalalyan and Alexandre B. Tsybakov, "Sparse Regression Learning by Aggregation and Langevin Monte-Carlo", arxiv:0903.1223
- Amit Daniely, Shai Shalev-Shwartz, "Optimal learners for multiclass problems", COLT 2014: 287--316
- C. De Mol, E. De Vito, L. Rosasco, "Elastic-Net Regularization in Learning Theory", arxiv:0807.3423
- Joshua V Dillon, Krishnakumar Balasubramanian, Guy Lebanon, "Asymptotic Analysis of Generative Semi-Supervised Learning", arxiv:1003.0024
- P. P. B. Eggermont, V. N. LaRiccia, "Uniform error bounds for smoothing splines", arxiv:math/0612776
- Vitaly Feldman, "Robustness of Evolvability" [PDF preprint]
- Majid Fozunbal, Ton Kalker, "Decision Making with Side Information and Unbounded Loss Functions", arxiv:cs/0601115
- Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization
bounds for averaged classifiers", Annals of
Statistics
**32**(2004): 1698--1722, math.ST/0410092 - Elisabeth Gassiat, Ramon Van Handel, "The local geometry of finite mixtures", arxiv:1202.3482
- Servane Gey, "Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts", arxiv:1203.0193
- Carl Gold and Peter Sollich, "Model Selection for Support Vector Machine Classification," cond-mat/0203334
- Lee-Ad Gottlied, Leonid Kontorovich, Elchanan Mossel, "VC bounds on the cardinality of nearly orthogonal function classes", arxiv:1007.4915
- Adityanand Guntuboyina, Bodhisattva Sen, "Covering Numbers for Convex Functions", IEEE Transactions on Information Theory
**59**(2013): 1957--1965 arxiv:1304.0147 - Steve Hanneke
- "Teaching Dimension and the Complexity of Active Learning" [PDF]
- "A Bound on the Label Complexity of Agnostic Active Learning" [PDF]
- "The Cost Complexity of Active Learning" [PDF]
- "Rates of convergence in active learning",
Annals of Statistics
**39**(2011): 333--361 - "Theory of Disagreement-Based Active Learning",
Foundations and Trends in Machine Learning
**7**(2014): 131--309

- Nancy Heckman, "The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy", arxiv:1111.1915
- Daniel J. L. Herrmann and Dominik Janzing, "Selection Criterion for Log-Linear Models Using Statistical Learning Theory", math.ST/0302079
- Don Hush, clint Scovel and Ingo Steinwart, "Stability of Unstable
Learning Algorithms", Machine Learning
**67**(2007): 197--206 - Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma, Systems That Learn: An Introduction to Learning Theory
- Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari, "Regularization Techniques for Learning with Matrices", Journal of Machine Learning Research
**13**(2012): 1865--1890 - Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, "Loss
functions, complexities, and the Legendre transformation", Theoretical Computer
Science
**313**(2004): 195--207 - Gerard Kerkyacharian and Dominique Picard, "Thresholding in Learning Theory", math.ST/0510271
- Jussi Klemela and Enno Mammen, "Empirical risk minimization in inverse problems", Annals of Statistics
**38**(2010): 482--511 - Vladimir Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems
- Natalia Komarova and Igor Rivin, "Mathematics of learning," math.PR/0105235
- Samuel Kutin and Partha Niyogi, "Almost-everywhere algorithmic stability and generalization error", UAI 2002, arxiv:1301.0579
- Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, "Learning Complexity Dimensions for a Continuous-Time Control System," math.OC/0012163
- John Langford, "Tutorial on Practical Prediction Theory for
Classification", Journal of Machine Learning Research
**6**(2005): 273--306 [For the PAC-Bayesian result] - A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, "Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains", 0709.2989
- Guillaume Lecué
- "Simultaneous adaptation to the margin and to complexity in classification", math.ST/0509696
- "Empirical risk minimization is optimal for the convex aggregation problem", Bernoulli
**19**(2013): 2153--2166

- Guillaume Lecué and Shahar Mendelson
- "Aggregation via empirical risk minimization", Probability Theory and
Related Fields
**145**(2009): 591--613 - "Sharper lower bounds on the performance of the empirical risk minimization algorithm", Bernoulli
**16**(2010): 605--613, arxiv:1102.4983 - "General nonexact oracle inequalities for classes with a subexponential envelope", Annals of Statistics
**40**(2012): 832--860 - "Learning subgaussian classes : Upper and minimax bounds", arxiv:1305.4825

- "Aggregation via empirical risk minimization", Probability Theory and
Related Fields
- Feng Liang, Sayan Mukherjee, Mike West, "The Use of Unlabeled Data
in Predictive
Modeling", arxiv:0710.4618
= Statistical Science
**22**(2007): 189--205 - Djamal Louani, Sidi Mohamed Ould Maouloud, "Large Deviation Results for the Nonparametric Regression Function Estimator on Functional Data", arxiv:1111.5989
- Gabor Lugosi and Marten Wegkamp, "Complexity regularization via
localized random penalties", Annals of
Statistics
**32**(2004): 16799--1697 = math.ST/0410091 - Malik Magdon-Ismail, "Permutation Complexity Bound on Out-Sample Error", NIPS 2010 [PDF reprint]
- Dörthe Malzahn and Manfred Opper, "A statistical physics approach for the analysis of machine learning algorithms on real data", Journal of Statistical Mechanics: Theory and Experiment (2005): P11001
- Pascal Massart and Élodie Nédélec, "Risk
bounds for statistical
learning", math.ST/0702683
= Annals of
Statistics
**34**(2006): 2326--2366 - Andreas Maurer, "A Note on the PAC Bayesian Theorem", cs.LG/0411099
- Andreas Maurer, Massimiliano Pontil, "Structured Sparsity and Generalization", Journal of Machine Learning Research
**13**(2012): 671--690 - Shahar Mendelson
- "Improving the Sample Complexity Using Global
Data", IEEE Transactions on Information Theory
**48**(2002): 1977--1991 - "Lower Bounds for the Empirical Minimization Algorithm",
IEEE Transactions on
Information Theory
**54**(2008): 3797--3803 ["simple argument ... that under mild geometric assumptions .... the empirical minimization algorithm cannot yield a uniform error rate that is faster than $1/sqrt{k}$ in the function learning setup".]

- "Improving the Sample Complexity Using Global
Data", IEEE Transactions on Information Theory
- Shahar Mendelson and Gideon Schechtman, "The shattering dimension
of sets of linear functionals", Annals of
Probability
**32**(2004): 1746--1770 = math.PR/0410096 - P. Mitra, C. A. Murthy and S. K. Pal, "A probabilistic active
support vector learning algorithm", IEEE Transactions on Pattern Analysis
and Machine Intelligence
**26**(2004): 413--418 - Sayan Mukherjee, Partha Niyogi, Tomaso Poggio and
Ryan Rifkin, "Learning theory: stability is sufficient for generalization
and necessary and sufficient for consistency of empirical
risk minimization", Advances in Computational Mathematics
**25**(2006): 161--193 [PDF. Thanks to Shiva Kaul for pointing this out to me.] - Boaz Nadler, "Finite sample approximation results for principal component analysis: a matrix perturbation approach", Annals of Statistics
**36**(2008): 2791--2817, arxiv:0901.3245 - Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner,
"Consistent Feature Selection for Pattern Recognition in Polynomial Time",
Journal of Machine Learning Research
**8**(2007): 589--612 - E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R.
Figueiras-Vidal and A. Navia-Vazquez, "Growing support vector classifiers with
controlled complexity," Pattern
Recognition Letters
**36**(2003): 1479--1488 - Vladimir Pestov
- "PAC learnability of a concept class under non-atomic measures: a problem by Vidyasagar", arxiv:1006.5090
- "PAC learnability versus VC dimension: a footnote to a basic result of statistical learning", arxiv:1104.2097

- Maxim Raginsky, "Achievability results for statistical learning under communication constraints", arxiv:0901.1905
- Alexander Rakhlin and Karthik Sridharan, "Statistical Learning Theory and Sequential Prediction" [PDF lecture notes]
- Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, "Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes", arxiv:0909.1933
- Sebastian Risau-Gusman and Mirta B. Gordon
- "Generalization Properties of Finite Size Polynomial
Support Vector Machines," cond-mat/0002071 =?
"Hieararchical Learning in Polynomial Support Vector Machines," Machine
Learning
**46**(2002): 53--70 - "Learning curves for Soft Margin Classifiers," cond-mat/0203315

- "Generalization Properties of Finite Size Polynomial
Support Vector Machines," cond-mat/0002071 =?
"Hieararchical Learning in Polynomial Support Vector Machines," Machine
Learning
- Igor Rivin, "The performance of the batch learner algorithm," cs.LG/0201009
- Dana Ron, "Property Testing: A Learning Theory
Perspective", Foundations and Trends in Machine Learning
**1**(2008): 307--402 [PDF preprint] - Benjamin I. P. Rubinstein, Aleksandr Simma, "On the Stability of
Empirical Risk Minimization in the Presence of Multiple Risk
Minimizers", IEEE
Transactions on Information Theory
**58**(2012): 4160--4163, arxiv:1002.2044 - J. Hyam Rubinstein, Benjamin I. P. Rubinstein, Peter L. Bartlett, "Bounding Embeddings of VC Classes into Maximum Classes", arxiv:1401.7388
- Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana
and Alessandro Verri, "Are Loss Functions All the Same?", Neural
Computation
**16**(2004): 1063--1076 [Ans.: pretty much, apparently, at least if they're reasonably convex] - Cynthia Rudin, "Stability Analysis for Regularized Least Squares Regression", cs.LG/0502016
- Sivan Sabato, Nathan Srebro, Naftali Tishby, "Distribution-Dependent Sample Complexity of Large Margin Learning", Journal of Machine Learning Research
**14**(2013): 2119--2149 - Tom Schaul, Sixin Zhang, Yann LeCun, "No More Pesky Learning Rates", arxiv:1206.1106
- Yevgeny Seldin, Nicolò Cesa-Bianchi, Françe;ois Laviolette, Peter Auer, John Shawe-Taylor, Jan Peters, "PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off", arxiv:1105.4585
- Glenn Shafer and Vladimir Vovk, "A tutorial on conformal prediction", arxiv:0706.3188
- Xuhui Shao, Vladimir Cherkassky and William Li, "Measuring the
VC-Dimension Using Optimized Experimental Design," Neural
Computation
**12**(2000): 1969--1986 - Xiaotong Shen and Lifeng Wang, "Generalization error for multi-class margin classification", Electronic Journal of Statistics
**1**(2007): 307--330, arxiv:0708.3556 - I. Steinwart, "Consistency of Support Vector Machines and Other
Regularized Kernel Classifiers", IEEE Transactions on Information
Theory
**51**(2005): 128--142 - Ingo Steinwart and Andreas Christmann, Support Vector Machines
- Joe Suzuki, "On Strong Consistency of Model Selection in
Classification", IEEE
Transactions on Information Theory
**52**(2006): 4767--4774 [Based on information-theoretic criteria] - Sara A. van de Geer
- "On non-asymptotic bounds for estimation in generalized linear models with highly correlated design", 0709.0844
- "High-dimensional generalized linear models and the
lasso", Annals of Statistics
**36**(2008): 614--645 = arxiv:0804.0703

- Aad van der Vaart and Jon A. Wellner, "A note on bounds for VC dimensions", pp. 103--107 in Christian Houdré, Vladimir Koltchinskii, David M. Mason and Magda Peligrad (eds.), High Dimensional Probability V: The Luminy Volume ["We provide bounds for the VC dimension of class of sets formed by unions, intersections, and products of VC classes of sets". Open access]
- V. Vapnik and O. Chapelle, "Bounds on Error Expectation for Support
Vector Machines," Neural Computation
**12**(2000): 2013--2036 - Silvia Villa, Lorenzo Rosasco, Tomaso Poggio, "On Learnability, Complexity and Stability", arxiv:1303.5976
- Ulrike von Luxburg, Bernhard Schoelkopf, "Statistical Learning Theory: Models, Concepts, and Results", arxiv:0810.4752
- Yu-Xiang Wang, Huan Xu, "Stability of matrix factorization for collaborative filtering", ICML 2012, arxiv:1206.4640
- Huan Xu and Shie Mannor, "Robustness and Generalization",
Machine
Learning
**86**(2012): 391--423 arxiv:1005.2243 - Yuhong Yang and Andrew Barron,
"Information-theoretic determination of minimax rates of convergence",
Annals of Statistics
**27**(1999): 1564--1599 - Kai Yu and Tong Zhang, "High Dimensional Nonlinear Learning using Local Coordinate Coding", arxiv:0906.5190
- Alon Zakai and Ya'acov Ritov, "Consistency and Localizability",
Journal of Machine Learning Reserch
**10**(2009): 827--856 - Chao Zhang and Dachen Tao, "Generalization Bound for Infinitely Divisible Empirical Process", Journal of Machine Learning Research Workshops and Conference Proceedings
**15**(2011): 864--872 - Tong Zhang
- "Covering Number Bounds of Certain Regularized Linear
Function
Classes", Journal
of Machine Learning Research
**2**(2002): 527--550 - "Learning Bounds for Kernel Regression Using Effective
Data
Dimensionality", Neural
Computation
**17**(2005): 2077--2098 - "Information-Theoretic Upper and Lower Bounds for
Statistical
Estimation", IEEE
Transactions on Information Theory
**52**(2006): 1307--1321

- "Covering Number Bounds of Certain Regularized Linear
Function
Classes", Journal
of Machine Learning Research
- Or Zuk, Shiri Margel, Eytan Domany, "On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network", UAI 2006, arxiv:1206.6862