Learning Theory (Formal, Computational or Statistical)
Last update: 07 Jul 2025 12:28First version: Before 24 April 2003 (2000?)
I qualify it to distinguish this area from the broader field of machine learning, which includes much more with lower standards of proof, and from the theory of learning in organisms, which might be quite different.
The basic set-up is as follows. We have a bunch of inputs and outputs, and an unknown relationship between the two. We do have a class of hypotheses describing this relationship, and suppose one of them is correct. (The hypothesis class is always circumscribed, but may be infinite.) A learning algorithm takes in a set of inputs and outputs, its data, and produces a hypothesis. Generally we assume the data are generated by some random process, and the hypothesis changes as the data change. The key notion is that of a probably approximately correct learning algorithm --- one where, if we supply enough data, we can get a hypothesis with an arbitrarily small error, with a probability arbitrarily close to one.
Generally, PAC-results concern (1) the existence of a PAC algorithm, (2) quantifying how much data we need, in terms of either accuracy or reliability, or (3) devising new PAC algorithms with other desirable properties. What frustrates me about this literature, and the reason I don't devote more of my research to it (aside, of course, from my sheer incompetence) is that almost all of it assumes the data are statistically independent and identically distributed. Then PAC-like results follow essentially from extensions of the ordinary Law of Large Numbers. What's really needed, however, is something more like an ergodic theorem, for suitably-dependent data. That, however, gets its own notebook.
An interesting question (which I learned of from Vidyasagar's book) has to do with the difference between distribution-free and distribution-dependent bounds. Generally, the latter are sharper, sometimes much sharper, but this comes at the price of making more or less strong parametric assumptions about the distribution. (One might indeed think of the theory of parametric statistical inference as learning theory with very strong distributional assumptions.) However, even in the distribution-free set up, we have a whole bunch of samples from the distribution, and non-parametric density estimation is certainly possible --- could one, e.g., improve the bounds by using half the sample to estimate the distribution, and then applying a distribution-dependent bound? Or will the uncertainty in the distributional estimate necessarily kill any advantage we might get from learning about the distribution? It feels like the latter would say something pretty deep (and depressing) about the whole project of observational science...
To learn more about: stability-based arguments.
- See also:
	
- Computational Complexity
- Concentration of Measure
- Conformal Prediction
- Decision Theory
- Density Estimation
- Empirical Process Theory
- Ensemble Methods in Machine Learning
- Expressive Power vs. Learnability
- Frequentist Consistency of Bayesian Procedures
- Interpolation in Statistical Learning
- Learning in Games
- Low-Regret Learning
- Optimization
- Statistics
- Statistics of Structured Data
- Universal Prediction Algorithms
- Recommended, bigger picture:
	
- Martin Anthony and Peter C. Bartlett, Neural Network Learning: Theoretical Foundations [The theory is much broader than just neural networks]
- Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [PDF. 39 pp. review on how to bound the error of your learning algorithms.]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines [While SVMs are one particular technology among others, this book does an excellent job of crisply introducing the general theory of learning, and showing its practicality.]
- Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
- John Lafferty and Larry Wasserman, Statistical Machine Learning [Unpublished lecture notes]
- Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning [Review]
- Maxim Raginsky, Statistical Learning Theory [Class webpage, with excellent notes and further readings]
- V. N. Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
- Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Mini-review.]
- Recommended, close-ups (very misc.):
	
- Alia Abbara, Benjamin Aubin, Florent Krzakala, Lenka Zdeborová, "Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning", arxiv:1912.02729
- Terrence M. Adams and Andrew B. Nobel, "Uniform Approximation of Vapnik-Chervonenkis Classes", arxiv:1010.4515
- Alekh Agarwal, John C. Duchi, Peter L. Bartlett, Clement Levrard, "Oracle inequalities for computationally budgeted model selection" [COLT 2011]
- David Balduzzi
		- "Information, learning and falsification", arxiv:1110.3592
- "Falsification and future performance", arxiv:1111.5648
 
- Moulinath Banerjee, "Covering Numbers and VC dimension" [review notes, UM statistics department, 2004; PS]
- Peter L. Bartlett and Shahar Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results", Journal of Machine Learning Research 3 (2002): 463--482
- Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
- Olivier Catoni, Statistical Learning Theory and Stochastic Optimization
- A. DeSantis, G. Markowsky, M.N. Wegman, "Learning probabilistic prediction functions", in 29th Annual Symposium on Foundations of Computer Science [FOCS 1988]
- Philip D. Laird, "Efficient unsupervised learning", pp. 297--311 in Proceddings of the first annual workshop on Computational Learning Theory [COLT 1988] [Nominally at this URL, but it seems, astonishingly, not to actually be available online anywhere
- Ben London, Bert Huang, Lise Getoor, "Graph-based Generalization Bounds for Learning Binary Relations", arxiv:1302.5348
- Gabor Lugosi, Shahar Mendelson and Vladimir Koltchinskii, "A note on the richness of convex hulls of VC classes", Electronic Communications in Probability 8 (2003): 18 ["We prove the existence of a class $ A $ of subsets of $ R^d $ of VC dimension 1 such that the symmetric convex hull $ F $ of the class of characteristic functions of sets in $ A $ is rich in the following sense. For any absolutely continuous probability measure $ m $ on $ R^d $, measurable set $ B $ and $ h >0 $, there exists a function $ f $ in $ F $ such that the measure of the symmetric difference of $ B $ and the set where $ f $ is positive is less than $ h $." The astonishingly simple proof turns on the Borel isomorphism theorem, which says that says the Borel sets of any complete, separable metric space are in one-to-one correspondence with the Borel sets of the unit interval.]
- Shahar Mendelson, "Learning without Concentration", COLT 2014: 25--39, arxiv:1401.0304
- Mircea Petrache and Shubhendu Trivedi, "Approximation-Generalization Trade-offs under (Approximate) Group Equivariance", arxiv:2305.17592 [This is a very nice set of results about when, and to what extent, models predict better when they show the same symmetries as the data-generating distribution (especially when the symmetry is only approximate).]
- David Pollard
		- "Asymptotics via Empirical Processes", Statistical Science 4 (1989): 341--354
- Empirical Processes: Theory and Applications [Review]
 
- Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
- Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
- Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari, "Online Learning: Random Averages, Combinatorial Parameters, and Learnability", arxiv:1006.1138 [This is not an easy paper to read, but the results are quite deep, and it repays the effort.]
- Philippe Rigollet and Xin Tong, "Neyman-Pearson classification, convexity and stochastic constraints", Journal of Machine Learning Research 12 (2011): 2831--2855, arxiv:1102.5750
- Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
- C. Scott and R. Nowak, "A Neyman-Pearson Approach to Statistical Learning", IEEE Transactions on Information Theory 51 (2005): 3806--3819 [Comments]
- Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik Sridharan, "Learnability, Stability and Uniform Convergence", Journal of Machine Learning Research 11 (2010): 2635--2670 [If one looks at a broader domain than the usual supervised regression or classification problems, uniform convergence of risks is not required for learnability, but the existence of a stable learning algorithm is. Of course much turns on definitions here, and I'm not completely certain, after only one reading, that I really buy all of theirs...]
- John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Transactions on Information Theory 44 (1998): 1926--1940 [There used to be a PDF on Prof. Williamson's website, but apparently no longer]
- Nathan Srebro, Karthik Sridharan, Ambuj Tewari, "Smoothness, Low-Noise and Fast Rates", arxiv:1009.3896
- J. Michael Steele, The Scary Sequences Project [Notes and references on individual-sequence prediction]
- Matus Telgarsky and Sanjoy Dasgupta, "Moment-based Uniform Deviation Bounds for $k$-means and Friends", arxiv:1311.1903
- Sara van de Geer, Empirical Processes in M-Estimation
- Ramon van Handel, "The universal Glivenko-Cantelli property", Probability Theory and Related Fields 155 (2013): 911--934, arxiv:1009.4434
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization", Communications of the ACM 64 (2021): 107--115 [previous version: arxiv:1611.03530]
- Xiaojin Zhu, Timothy T. Rogers and Bryan R. Gibson, "Human Rademacher Complexity", in Advances in Neural Information Processing, vol. 22 (NIPS 2009) [my commentary]
- Modesty forbids me to recommend:
	
- Daniel J. McDonald, CRS and Mark Schervish, "Estimated VC Dimension for Risk Bounds", arxiv:1111.3404 [More]
- CRS, Lecture notes for 36-465, "Conceptual Foundations of Statistical Learning" [By year; most recently, as I write, 2021]
- To read:
	
- Yohji Akama, Kei Irie, "VC dimension of ellipsoids", arxiv:1109.4347
- Pierre Alquier, "PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers", arxiv:0712.1698
- Andris Ambainis, "Probabilistic and Team PFIN-type Learning: General Properties", cs.LG/0504001
- Sylvain Arlot, Peter L. Bartlett, "Margin-adaptive model selection in statistical learning", Bernoulli 17 (2011): 687--713, arxiv:0804.2937
- Jean-Yves Audibert, "Fast learning rates in statistical inference through aggregation", Annals of Statistics 37 (2009): 1591--1646, math.ST/0703854
- Jean-Yves Audibert and Olivier Bousquet, "Combining PAC-Bayesian and Generic Chaining Bounds", Journal of Machine Learning Research 8 (2007): 863--889
- Francis Bach, "Sharp analysis of low-rank kernel matrix approximations", arxiv:1208.2015
- Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon, "Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels", arxiv:1003.0470
- Peter L. Bartlett, Olivier Bousquet, Shahar Mendelson, "Local Rademacher complexities", Annals of Statistics 33 (2005): 1497--1537, arxiv:math/0508275
- Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, Ronald A. DeVore, "Approximation and learning by greedy algorithms", Annals of Statistics 36 (2008): 64--94, arxiv:0803.1718
- José Bento and Andrea Montanari, "Which graphical models are difficult to learn?", arxiv:0910.5761
- G. Biau and L. Devroye, "A Note on Density Model Size Testing", IEEE Transactions on Information Theory 50 (2004): 576--581 [Testing which member of a nested family of model classes you need to use]
- Gilles Blanchard, Olivier Bousquet, Pascal Massart, "Statistical performance of support vector machines", Annals of Statistics 36 (2008): 489--531, arxiv:0804.0551
- Gilles Blanchard and Francois Fleuret, "Occam's hammer: a link between randomized learning and multiple testing FDR control", math.ST/0608713
- Adam Block, Alexander Rakhlin, Abhishek Shetty, "On the Performance of Empirical Risk Minimization with Smoothed Data", arxiv:2402.14987
- Olivier Bousquet, Yegor Klochkov, Nikita Zhivotovskiy, "Sharper bounds for uniformly stable algorithms", arxiv:1910.07833
- Christian Brownlees, Emilien Joly, Gabor Lugosi, "Empirical risk minimization for heavy-tailed losses", arxiv:1406.2462
- Arnaud Buhot and Mitra B. Gordon, "Storage Capacity of the Tilinglike Learning Algorithm," cond-mat/0008162
- Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, "Rigorous Bounds to Retarded Learning," cond-mat/0201256
- Stephane Canu, Xavier Mary, Alain Rakotomamonjy, "Functional learning through kernels", arxiv:0910.1013
- Andrea Caponnetto and Alexander Rakhlin, "Stability Properties of Empirical Risk Minimization over Donsker Classes", Journal of Machine Learning Research 7 (2006): 2565--2583
- Olivier Catoni
		- "Improved Vapnik Cervonenkis bounds", math.ST/0410280
- Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning [Full-text open access]
 
- Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate, "Differentially Private Empirical Risk Minimization", Journal of Machine Learning Research 12 (2011): 1069--1109
- Olivier Chapelle, Bernhard Schölkopf and Alexander Zien (eds.), Semi-Supervised Learning
- Yifeng Chu, Maxim Raginsky, "A unified framework for information-theoretic generalization bounds", arxiv:2305.11042
- Thomas M. Cover, "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition", IEEE Transactions on Electronic Computers EC-14 (1965): 326--334 [Via rvenkat, who points out that this is a remarkable anticipation of VC theory]
- Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
- Felipe Cucker and Ding Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
- Arnak Dalalyan and Alexandre B. Tsybakov, "Sparse Regression Learning by Aggregation and Langevin Monte-Carlo", arxiv:0903.1223
- Amit Daniely, Shai Shalev-Shwartz, "Optimal learners for multiclass problems", COLT 2014: 287--316
- Ofir David, Shay Moran, Amir Yehudayoff, "On statistical learning via the lens of compression", NeurIPS 2016 pp. 2792--2800
- C. De Mol, E. De Vito, L. Rosasco, "Elastic-Net Regularization in Learning Theory", arxiv:0807.3423
- Joshua V Dillon, Krishnakumar Balasubramanian, Guy Lebanon, "Asymptotic Analysis of Generative Semi-Supervised Learning", arxiv:1003.0024
- Kefan Dong, Tengyu Ma, "Toward \( L_{\infty} \)-recovery of Nonlinear Functions: A Polynomial Sample Complexity Bound for Gaussian Random Fields", arxiv:2305.00322 [This sounds interesting, but it also seems to say that Gaussian random fields generate especially smooth functions (with high probability), casting doubt on their suitability as a general prior. (Of course I think there are no general priors.)]
- Weinan E, Chao Ma, Lei Wu, "The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks", Pure and Applied Functional Analysis 5 (2020): 1145--1460, arxiv:1912.06987
- P. P. B. Eggermont, V. N. LaRiccia, "Uniform error bounds for smoothing splines", arxiv:math/0612776
- Vitaly Feldman, "Robustness of Evolvability" [PDF preprint]
- Majid Fozunbal, Ton Kalker, "Decision Making with Side Information and Unbounded Loss Functions", arxiv:cs/0601115
- Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722, math.ST/0410092
- Elisabeth Gassiat, Ramon Van Handel, "The local geometry of finite mixtures", arxiv:1202.3482
- Servane Gey, "Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts", arxiv:1203.0193
- Carl Gold and Peter Sollich, "Model Selection for Support Vector Machine Classification," cond-mat/0203334
- Noah Golowich, Ankur Moitra, Dhruv Rohatgi, "Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning", arxiv:2404.03774
- Lee-Ad Gottlied, Leonid Kontorovich, Elchanan Mossel, "VC bounds on the cardinality of nearly orthogonal function classes", arxiv:1007.4915
- Adityanand Guntuboyina, Bodhisattva Sen, "Covering Numbers for Convex Functions", IEEE Transactions on Information Theory 59 (2013): 1957--1965 arxiv:1304.0147
- Steve Hanneke
		- "Teaching Dimension and the Complexity of Active Learning" [PDF]
- "A Bound on the Label Complexity of Agnostic Active Learning" [PDF]
- "The Cost Complexity of Active Learning" [PDF]
- "Rates of convergence in active learning", Annals of Statistics 39 (2011): 333--361
- "Theory of Disagreement-Based Active Learning", Foundations and Trends in Machine Learning 7 (2014): 131--309
 
- Steve Hanneke, Aryeh Kontorovich, Guy Kornowski, "Efficient Agnostic Learning with Average Smoothness", arxiv:2309.17016
- Nancy Heckman, "The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy", arxiv:1111.1915
- Fredrik Hellstr&/ouml;m, Giuseppe Durisi, Benjamin Guedj, Maxim Raginsky, "Generalization Bounds: Perspectives from Information Theory and PAC-Bayes", arxiv:2309.04381
- Daniel J. L. Herrmann and Dominik Janzing, "Selection Criterion for Log-Linear Models Using Statistical Learning Theory", math.ST/0302079
- Don Hush, clint Scovel and Ingo Steinwart, "Stability of Unstable Learning Algorithms", Machine Learning 67 (2007): 197--206
- Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma, Systems That Learn: An Introduction to Learning Theory
- Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari, "Regularization Techniques for Learning with Matrices", Journal of Machine Learning Research 13 (2012): 1865--1890
- Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, "Loss functions, complexities, and the Legendre transformation", Theoretical Computer Science 313 (2004): 195--207
- Gerard Kerkyacharian and Dominique Picard, "Thresholding in Learning Theory", math.ST/0510271
- Jussi Klemela and Enno Mammen, "Empirical risk minimization in inverse problems", Annals of Statistics 38 (2010): 482--511
- Michael Kohler, Adam Krzyzak, "Over-parametrized deep neural networks do not generalize well", arxiv:1912.03925
- Vladimir Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems
- Natalia Komarova and Igor Rivin, "Mathematics of learning," math.PR/0105235
- Samuel Kutin and Partha Niyogi, "Almost-everywhere algorithmic stability and generalization error", UAI 2002, arxiv:1301.0579
- Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, "Learning Complexity Dimensions for a Continuous-Time Control System," math.OC/0012163
- John Langford, "Tutorial on Practical Prediction Theory for Classification", Journal of Machine Learning Research 6 (2005): 273--306 [For the PAC-Bayesian result]
- Kasper Green Larsen, "Bagging is an Optimal PAC Learner", arxiv:2212.02264
- A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, "Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains", 0709.2989
- Guillaume Lecué
		- "Simultaneous adaptation to the margin and to complexity in classification", math.ST/0509696
- "Empirical risk minimization is optimal for the convex aggregation problem", Bernoulli 19 (2013): 2153--2166
 
- Guillaume Lecué and Shahar Mendelson
		- "Aggregation via empirical risk minimization", Probability Theory and Related Fields 145 (2009): 591--613
- "Sharper lower bounds on the performance of the empirical risk minimization algorithm", Bernoulli 16 (2010): 605--613, arxiv:1102.4983
- "General nonexact oracle inequalities for classes with a subexponential envelope", Annals of Statistics 40 (2012): 832--860
- "Learning subgaussian classes : Upper and minimax bounds", arxiv:1305.4825
 
- Feng Liang, Sayan Mukherjee, Mike West, "The Use of Unlabeled Data in Predictive Modeling", arxiv:0710.4618 = Statistical Science 22 (2007): 189--205
- Nick Littlestone and Manfred K. Warmuth, "Relating Data Compression and Learnability", 1986 preprint?
- Djamal Louani, Sidi Mohamed Ould Maouloud, "Large Deviation Results for the Nonparametric Regression Function Estimator on Functional Data", arxiv:1111.5989
- Gabor Lugosi and Marten Wegkamp, "Complexity regularization via localized random penalties", Annals of Statistics 32 (2004): 16799--1697 = math.ST/0410091
- Malik Magdon-Ismail, "Permutation Complexity Bound on Out-Sample Error", NIPS 2010 [PDF reprint]
- Dörthe Malzahn and Manfred Opper, "A statistical physics approach for the analysis of machine learning algorithms on real data", Journal of Statistical Mechanics: Theory and Experiment (2005): P11001
- Pascal Massart and Élodie Nédélec, "Risk bounds for statistical learning", math.ST/0702683 = Annals of Statistics 34 (2006): 2326--2366
- Andreas Maurer, "A Note on the PAC Bayesian Theorem", cs.LG/0411099
- Andreas Maurer, Massimiliano Pontil, "Structured Sparsity and Generalization", Journal of Machine Learning Research 13 (2012): 671--690
- Shahar Mendelson
		- "Improving the Sample Complexity Using Global Data", IEEE Transactions on Information Theory 48 (2002): 1977--1991
- "Lower Bounds for the Empirical Minimization Algorithm", IEEE Transactions on Information Theory 54 (2008): 3797--3803 ["simple argument ... that under mild geometric assumptions .... the empirical minimization algorithm cannot yield a uniform error rate that is faster than $1/sqrt{k}$ in the function learning setup".]
 
- Shahar Mendelson and Gideon Schechtman, "The shattering dimension of sets of linear functionals", Annals of Probability 32 (2004): 1746--1770 = math.PR/0410096
- P. Mitra, C. A. Murthy and S. K. Pal, "A probabilistic active support vector learning algorithm", IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004): 413--418
- Sayan Mukherjee, Partha Niyogi, Tomaso Poggio and Ryan Rifkin, "Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization", Advances in Computational Mathematics 25 (2006): 161--193 [PDF. Thanks to Shiva Kaul for pointing this out to me.]
- Boaz Nadler, "Finite sample approximation results for principal component analysis: a matrix perturbation approach", Annals of Statistics 36 (2008): 2791--2817, arxiv:0901.3245
- Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner, "Consistent Feature Selection for Pattern Recognition in Polynomial Time", Journal of Machine Learning Research 8 (2007): 589--612
- E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R. Figueiras-Vidal and A. Navia-Vazquez, "Growing support vector classifiers with controlled complexity," Pattern Recognition Letters 36 (2003): 1479--1488
- Vladimir Pestov
		- "PAC learnability of a concept class under non-atomic measures: a problem by Vidyasagar", arxiv:1006.5090
- "PAC learnability versus VC dimension: a footnote to a basic result of statistical learning", arxiv:1104.2097
 
- Maxim Raginsky, "Achievability results for statistical learning under communication constraints", arxiv:0901.1905
- Alexander Rakhlin and Karthik Sridharan, "Statistical Learning Theory and Sequential Prediction" [PDF lecture notes]
- Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, "Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes", arxiv:0909.1933
- Sebastian Risau-Gusman and Mirta B. Gordon
		- "Generalization Properties of Finite Size Polynomial Support Vector Machines," cond-mat/0002071 =? "Hieararchical Learning in Polynomial Support Vector Machines," Machine Learning 46 (2002): 53--70
- "Learning curves for Soft Margin Classifiers," cond-mat/0203315
 
- Igor Rivin, "The performance of the batch learner algorithm," cs.LG/0201009
- Dana Ron, "Property Testing: A Learning Theory Perspective", Foundations and Trends in Machine Learning 1 (2008): 307--402 [PDF preprint]
- Benjamin I. P. Rubinstein, Aleksandr Simma, "On the Stability of Empirical Risk Minimization in the Presence of Multiple Risk Minimizers", IEEE Transactions on Information Theory 58 (2012): 4160--4163, arxiv:1002.2044
- J. Hyam Rubinstein, Benjamin I. P. Rubinstein, Peter L. Bartlett, "Bounding Embeddings of VC Classes into Maximum Classes", arxiv:1401.7388
- Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri, "Are Loss Functions All the Same?", Neural Computation 16 (2004): 1063--1076
- Cynthia Rudin, "Stability Analysis for Regularized Least Squares Regression", cs.LG/0502016
- Sivan Sabato, Nathan Srebro, Naftali Tishby, "Distribution-Dependent Sample Complexity of Large Margin Learning", Journal of Machine Learning Research 14 (2013): 2119--2149
- Narayana Santhanam, Venkatachalam Anantharam, Wojciech Szpankowski, "Data-Derived Weak Universal Consistency", Journal of Machine Learning Research 23 (2022): 27
- Tom Schaul, Sixin Zhang, Yann LeCun, "No More Pesky Learning Rates", arxiv:1206.1106
- Milad Sefidgaran, Abdellatif Zaidi, "Data-dependent Generalization Bounds via Variable-Size Compressibility", arxiv:2303.05369
- Yevgeny Seldin, Nicolò Cesa-Bianchi, Françe;ois Laviolette, Peter Auer, John Shawe-Taylor, Jan Peters, "PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off", arxiv:1105.4585
- Xuhui Shao, Vladimir Cherkassky and William Li, "Measuring the VC-Dimension Using Optimized Experimental Design," Neural Computation 12 (2000): 1969--1986
- Xiaotong Shen and Lifeng Wang, "Generalization error for multi-class margin classification", Electronic Journal of Statistics 1 (2007): 307--330, arxiv:0708.3556
- I. Steinwart, "Consistency of Support Vector Machines and Other Regularized Kernel Classifiers", IEEE Transactions on Information Theory 51 (2005): 128--142
- Ingo Steinwart and Andreas Christmann, Support Vector Machines
- Joe Suzuki, "On Strong Consistency of Model Selection in Classification", IEEE Transactions on Information Theory 52 (2006): 4767--4774 [Based on information-theoretic criteria]
- Matus Telgarsky, "Stochastic linear optimization never overfits with quadratically-bounded losses on general data", arxiv:2202.06915
- Sara A. van de Geer
		- "On non-asymptotic bounds for estimation in generalized linear models with highly correlated design", 0709.0844
- "High-dimensional generalized linear models and the lasso", Annals of Statistics 36 (2008): 614--645 = arxiv:0804.0703
 
- Aad van der Vaart and Jon A. Wellner, "A note on bounds for VC dimensions", pp. 103--107 in Christian Houdré, Vladimir Koltchinskii, David M. Mason and Magda Peligrad (eds.), High Dimensional Probability V: The Luminy Volume ["We provide bounds for the VC dimension of class of sets formed by unions, intersections, and products of VC classes of sets". Open access]
- V. Vapnik and O. Chapelle, "Bounds on Error Expectation for Support Vector Machines," Neural Computation 12 (2000): 2013--2036
- Silvia Villa, Lorenzo Rosasco, Tomaso Poggio, "On Learnability, Complexity and Stability", arxiv:1303.5976
- Ulrike von Luxburg, Bernhard Schoelkopf, "Statistical Learning Theory: Models, Concepts, and Results", arxiv:0810.4752
- Yu-Xiang Wang, Huan Xu, "Stability of matrix factorization for collaborative filtering", ICML 2012, arxiv:1206.4640
- Huan Xu and Shie Mannor, "Robustness and Generalization", Machine Learning 86 (2012): 391--423 arxiv:1005.2243
- Yuhong Yang and Andrew Barron, "Information-theoretic determination of minimax rates of convergence", Annals of Statistics 27 (1999): 1564--1599
- Kai Yu and Tong Zhang, "High Dimensional Nonlinear Learning using Local Coordinate Coding", arxiv:0906.5190
- Alon Zakai and Ya'acov Ritov, "Consistency and Localizability", Journal of Machine Learning Reserch 10 (2009): 827--856
- Chao Zhang and Dachen Tao, "Generalization Bound for Infinitely Divisible Empirical Process", Journal of Machine Learning Research Workshops and Conference Proceedings 15 (2011): 864--872
- Tong Zhang
		- "Covering Number Bounds of Certain Regularized Linear Function Classes", Journal of Machine Learning Research 2 (2002): 527--550
- "Learning Bounds for Kernel Regression Using Effective Data Dimensionality", Neural Computation 17 (2005): 2077--2098
- "Information-Theoretic Upper and Lower Bounds for Statistical Estimation", IEEE Transactions on Information Theory 52 (2006): 1307--1321
 
- Ying Zhu, "Phase transitions in nonparametric regressions: a curse of exploiting higher degree smoothness assumptions in finite samples", arxiv:2112.03626
- Or Zuk, Shiri Margel, Eytan Domany, "On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network", UAI 2006, arxiv:1206.6862