Learning Theory (Formal, Computational or Statistical)

Last update: 07 Jul 2025 12:28
First version: Before 24 April 2003 (2000?)

I qualify it to distinguish this area from the broader field of machine learning, which includes much more with lower standards of proof, and from the theory of learning in organisms, which might be quite different.

The basic set-up is as follows. We have a bunch of inputs and outputs, and an unknown relationship between the two. We do have a class of hypotheses describing this relationship, and suppose one of them is correct. (The hypothesis class is always circumscribed, but may be infinite.) A learning algorithm takes in a set of inputs and outputs, its data, and produces a hypothesis. Generally we assume the data are generated by some random process, and the hypothesis changes as the data change. The key notion is that of a probably approximately correct learning algorithm --- one where, if we supply enough data, we can get a hypothesis with an arbitrarily small error, with a probability arbitrarily close to one.

Generally, PAC-results concern (1) the existence of a PAC algorithm, (2) quantifying how much data we need, in terms of either accuracy or reliability, or (3) devising new PAC algorithms with other desirable properties. What frustrates me about this literature, and the reason I don't devote more of my research to it (aside, of course, from my sheer incompetence) is that almost all of it assumes the data are statistically independent and identically distributed. Then PAC-like results follow essentially from extensions of the ordinary Law of Large Numbers. What's really needed, however, is something more like an ergodic theorem, for suitably-dependent data. That, however, gets its own notebook.

An interesting question (which I learned of from Vidyasagar's book) has to do with the difference between distribution-free and distribution-dependent bounds. Generally, the latter are sharper, sometimes much sharper, but this comes at the price of making more or less strong parametric assumptions about the distribution. (One might indeed think of the theory of parametric statistical inference as learning theory with very strong distributional assumptions.) However, even in the distribution-free set up, we have a whole bunch of samples from the distribution, and non-parametric density estimation is certainly possible --- could one, e.g., improve the bounds by using half the sample to estimate the distribution, and then applying a distribution-dependent bound? Or will the uncertainty in the distributional estimate necessarily kill any advantage we might get from learning about the distribution? It feels like the latter would say something pretty deep (and depressing) about the whole project of observational science...

To learn more about: stability-based arguments.

Martin Anthony and Peter C. Bartlett, Neural Network Learning: Theoretical Foundations [The theory is much broader than just neural networks]
Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [PDF. 39 pp. review on how to bound the error of your learning algorithms.]
Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines [While SVMs are one particular technology among others, this book does an excellent job of crisply introducing the general theory of learning, and showing its practicality.]
Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
John Lafferty and Larry Wasserman, Statistical Machine Learning [Unpublished lecture notes]
Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning [Review]
Maxim Raginsky, Statistical Learning Theory [Class webpage, with excellent notes and further readings]
V. N. Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Mini-review.]

Alia Abbara, Benjamin Aubin, Florent Krzakala, Lenka Zdeborová, "Rademacher complexity and spin glasses: A link between the replica and statistical theories of learning", arxiv:1912.02729
Terrence M. Adams and Andrew B. Nobel, "Uniform Approximation of Vapnik-Chervonenkis Classes", arxiv:1010.4515
Alekh Agarwal, John C. Duchi, Peter L. Bartlett, Clement Levrard, "Oracle inequalities for computationally budgeted model selection" [COLT 2011]
David Balduzzi
- "Information, learning and falsification", arxiv:1110.3592
- "Falsification and future performance", arxiv:1111.5648
Moulinath Banerjee, "Covering Numbers and VC dimension" [review notes, UM statistics department, 2004; PS]
Peter L. Bartlett and Shahar Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results", Journal of Machine Learning Research 3 (2002): 463--482
Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
Olivier Catoni, Statistical Learning Theory and Stochastic Optimization
A. DeSantis, G. Markowsky, M.N. Wegman, "Learning probabilistic prediction functions", in 29th Annual Symposium on Foundations of Computer Science [FOCS 1988]
Philip D. Laird, "Efficient unsupervised learning", pp. 297--311 in Proceddings of the first annual workshop on Computational Learning Theory [COLT 1988] [Nominally at this URL, but it seems, astonishingly, not to actually be available online anywhere
Ben London, Bert Huang, Lise Getoor, "Graph-based Generalization Bounds for Learning Binary Relations", arxiv:1302.5348
Gabor Lugosi, Shahar Mendelson and Vladimir Koltchinskii, "A note on the richness of convex hulls of VC classes", Electronic Communications in Probability 8 (2003): 18 ["We prove the existence of a class $ A $ of subsets of $ R^d $ of VC dimension 1 such that the symmetric convex hull $ F $ of the class of characteristic functions of sets in $ A $ is rich in the following sense. For any absolutely continuous probability measure $ m $ on $ R^d $, measurable set $ B $ and $ h >0 $, there exists a function $ f $ in $ F $ such that the measure of the symmetric difference of $ B $ and the set where $ f $ is positive is less than $ h $." The astonishingly simple proof turns on the Borel isomorphism theorem, which says that says the Borel sets of any complete, separable metric space are in one-to-one correspondence with the Borel sets of the unit interval.]
Shahar Mendelson, "Learning without Concentration", COLT 2014: 25--39, arxiv:1401.0304
Mircea Petrache and Shubhendu Trivedi, "Approximation-Generalization Trade-offs under (Approximate) Group Equivariance", arxiv:2305.17592 [This is a very nice set of results about when, and to what extent, models predict better when they show the same symmetries as the data-generating distribution (especially when the symmetry is only approximate).]
David Pollard
- "Asymptotics via Empirical Processes", Statistical Science 4 (1989): 341--354
- Empirical Processes: Theory and Applications [Review]
Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari, "Online Learning: Random Averages, Combinatorial Parameters, and Learnability", arxiv:1006.1138 [This is not an easy paper to read, but the results are quite deep, and it repays the effort.]
Philippe Rigollet and Xin Tong, "Neyman-Pearson classification, convexity and stochastic constraints", Journal of Machine Learning Research 12 (2011): 2831--2855, arxiv:1102.5750
Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
C. Scott and R. Nowak, "A Neyman-Pearson Approach to Statistical Learning", IEEE Transactions on Information Theory 51 (2005): 3806--3819 [Comments]
Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik Sridharan, "Learnability, Stability and Uniform Convergence", Journal of Machine Learning Research 11 (2010): 2635--2670 [If one looks at a broader domain than the usual supervised regression or classification problems, uniform convergence of risks is not required for learnability, but the existence of a stable learning algorithm is. Of course much turns on definitions here, and I'm not completely certain, after only one reading, that I really buy all of theirs...]
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Transactions on Information Theory 44 (1998): 1926--1940 [There used to be a PDF on Prof. Williamson's website, but apparently no longer]
Nathan Srebro, Karthik Sridharan, Ambuj Tewari, "Smoothness, Low-Noise and Fast Rates", arxiv:1009.3896
J. Michael Steele, The Scary Sequences Project [Notes and references on individual-sequence prediction]
Matus Telgarsky and Sanjoy Dasgupta, "Moment-based Uniform Deviation Bounds for $k$-means and Friends", arxiv:1311.1903
Sara van de Geer, Empirical Processes in M-Estimation
Ramon van Handel, "The universal Glivenko-Cantelli property", Probability Theory and Related Fields 155 (2013): 911--934, arxiv:1009.4434
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization", Communications of the ACM 64 (2021): 107--115 [previous version: arxiv:1611.03530]
Xiaojin Zhu, Timothy T. Rogers and Bryan R. Gibson, "Human Rademacher Complexity", in Advances in Neural Information Processing, vol. 22 (NIPS 2009) [my commentary]

Daniel J. McDonald, CRS and Mark Schervish, "Estimated VC Dimension for Risk Bounds", arxiv:1111.3404 [More]
CRS, Lecture notes for 36-465, "Conceptual Foundations of Statistical Learning" [By year; most recently, as I write, 2021]

Yohji Akama, Kei Irie, "VC dimension of ellipsoids", arxiv:1109.4347
Pierre Alquier, "PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers", arxiv:0712.1698
Andris Ambainis, "Probabilistic and Team PFIN-type Learning: General Properties", cs.LG/0504001
Sylvain Arlot, Peter L. Bartlett, "Margin-adaptive model selection in statistical learning", Bernoulli 17 (2011): 687--713, arxiv:0804.2937
Jean-Yves Audibert, "Fast learning rates in statistical inference through aggregation", Annals of Statistics 37 (2009): 1591--1646, math.ST/0703854
Jean-Yves Audibert and Olivier Bousquet, "Combining PAC-Bayesian and Generic Chaining Bounds", Journal of Machine Learning Research 8 (2007): 863--889
Francis Bach, "Sharp analysis of low-rank kernel matrix approximations", arxiv:1208.2015
Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon, "Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels", arxiv:1003.0470
Peter L. Bartlett, Olivier Bousquet, Shahar Mendelson, "Local Rademacher complexities", Annals of Statistics 33 (2005): 1497--1537, arxiv:math/0508275
Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, Ronald A. DeVore, "Approximation and learning by greedy algorithms", Annals of Statistics 36 (2008): 64--94, arxiv:0803.1718
José Bento and Andrea Montanari, "Which graphical models are difficult to learn?", arxiv:0910.5761
G. Biau and L. Devroye, "A Note on Density Model Size Testing", IEEE Transactions on Information Theory 50 (2004): 576--581 [Testing which member of a nested family of model classes you need to use]
Gilles Blanchard, Olivier Bousquet, Pascal Massart, "Statistical performance of support vector machines", Annals of Statistics 36 (2008): 489--531, arxiv:0804.0551
Gilles Blanchard and Francois Fleuret, "Occam's hammer: a link between randomized learning and multiple testing FDR control", math.ST/0608713
Adam Block, Alexander Rakhlin, Abhishek Shetty, "On the Performance of Empirical Risk Minimization with Smoothed Data", arxiv:2402.14987
Olivier Bousquet, Yegor Klochkov, Nikita Zhivotovskiy, "Sharper bounds for uniformly stable algorithms", arxiv:1910.07833
Christian Brownlees, Emilien Joly, Gabor Lugosi, "Empirical risk minimization for heavy-tailed losses", arxiv:1406.2462
Arnaud Buhot and Mitra B. Gordon, "Storage Capacity of the Tilinglike Learning Algorithm," cond-mat/0008162
Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, "Rigorous Bounds to Retarded Learning," cond-mat/0201256
Stephane Canu, Xavier Mary, Alain Rakotomamonjy, "Functional learning through kernels", arxiv:0910.1013
Andrea Caponnetto and Alexander Rakhlin, "Stability Properties of Empirical Risk Minimization over Donsker Classes", Journal of Machine Learning Research 7 (2006): 2565--2583
Olivier Catoni
- "Improved Vapnik Cervonenkis bounds", math.ST/0410280
- Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning [Full-text open access]
Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate, "Differentially Private Empirical Risk Minimization", Journal of Machine Learning Research 12 (2011): 1069--1109
Olivier Chapelle, Bernhard Schölkopf and Alexander Zien (eds.), Semi-Supervised Learning
Yifeng Chu, Maxim Raginsky, "A unified framework for information-theoretic generalization bounds", arxiv:2305.11042
Thomas M. Cover, "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition", IEEE Transactions on Electronic Computers EC-14 (1965): 326--334 [Via rvenkat, who points out that this is a remarkable anticipation of VC theory]
Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
Felipe Cucker and Ding Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
Arnak Dalalyan and Alexandre B. Tsybakov, "Sparse Regression Learning by Aggregation and Langevin Monte-Carlo", arxiv:0903.1223
Amit Daniely, Shai Shalev-Shwartz, "Optimal learners for multiclass problems", COLT 2014: 287--316
Ofir David, Shay Moran, Amir Yehudayoff, "On statistical learning via the lens of compression", NeurIPS 2016 pp. 2792--2800
C. De Mol, E. De Vito, L. Rosasco, "Elastic-Net Regularization in Learning Theory", arxiv:0807.3423
Joshua V Dillon, Krishnakumar Balasubramanian, Guy Lebanon, "Asymptotic Analysis of Generative Semi-Supervised Learning", arxiv:1003.0024
Kefan Dong, Tengyu Ma, "Toward $ L_{\infty} $-recovery of Nonlinear Functions: A Polynomial Sample Complexity Bound for Gaussian Random Fields", arxiv:2305.00322 [This sounds interesting, but it also seems to say that Gaussian random fields generate especially smooth functions (with high probability), casting doubt on their suitability as a general prior. (Of course I think there are no general priors.)]
Weinan E, Chao Ma, Lei Wu, "The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks", Pure and Applied Functional Analysis 5 (2020): 1145--1460, arxiv:1912.06987
P. P. B. Eggermont, V. N. LaRiccia, "Uniform error bounds for smoothing splines", arxiv:math/0612776
Vitaly Feldman, "Robustness of Evolvability" [PDF preprint]
Majid Fozunbal, Ton Kalker, "Decision Making with Side Information and Unbounded Loss Functions", arxiv:cs/0601115
Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722, math.ST/0410092
Elisabeth Gassiat, Ramon Van Handel, "The local geometry of finite mixtures", arxiv:1202.3482
Servane Gey, "Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts", arxiv:1203.0193
Carl Gold and Peter Sollich, "Model Selection for Support Vector Machine Classification," cond-mat/0203334
Noah Golowich, Ankur Moitra, Dhruv Rohatgi, "Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning", arxiv:2404.03774
Lee-Ad Gottlied, Leonid Kontorovich, Elchanan Mossel, "VC bounds on the cardinality of nearly orthogonal function classes", arxiv:1007.4915
Adityanand Guntuboyina, Bodhisattva Sen, "Covering Numbers for Convex Functions", IEEE Transactions on Information Theory 59 (2013): 1957--1965 arxiv:1304.0147
Steve Hanneke
- "Teaching Dimension and the Complexity of Active Learning" [PDF]
- "A Bound on the Label Complexity of Agnostic Active Learning" [PDF]
- "The Cost Complexity of Active Learning" [PDF]
- "Rates of convergence in active learning", Annals of Statistics 39 (2011): 333--361
- "Theory of Disagreement-Based Active Learning", Foundations and Trends in Machine Learning 7 (2014): 131--309
Steve Hanneke, Aryeh Kontorovich, Guy Kornowski, "Efficient Agnostic Learning with Average Smoothness", arxiv:2309.17016
Nancy Heckman, "The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy", arxiv:1111.1915
Fredrik Hellstr&/ouml;m, Giuseppe Durisi, Benjamin Guedj, Maxim Raginsky, "Generalization Bounds: Perspectives from Information Theory and PAC-Bayes", arxiv:2309.04381
Daniel J. L. Herrmann and Dominik Janzing, "Selection Criterion for Log-Linear Models Using Statistical Learning Theory", math.ST/0302079
Don Hush, clint Scovel and Ingo Steinwart, "Stability of Unstable Learning Algorithms", Machine Learning 67 (2007): 197--206
Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma, Systems That Learn: An Introduction to Learning Theory
Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari, "Regularization Techniques for Learning with Matrices", Journal of Machine Learning Research 13 (2012): 1865--1890
Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, "Loss functions, complexities, and the Legendre transformation", Theoretical Computer Science 313 (2004): 195--207
Gerard Kerkyacharian and Dominique Picard, "Thresholding in Learning Theory", math.ST/0510271
Jussi Klemela and Enno Mammen, "Empirical risk minimization in inverse problems", Annals of Statistics 38 (2010): 482--511
Michael Kohler, Adam Krzyzak, "Over-parametrized deep neural networks do not generalize well", arxiv:1912.03925
Vladimir Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems
Natalia Komarova and Igor Rivin, "Mathematics of learning," math.PR/0105235
Samuel Kutin and Partha Niyogi, "Almost-everywhere algorithmic stability and generalization error", UAI 2002, arxiv:1301.0579
Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, "Learning Complexity Dimensions for a Continuous-Time Control System," math.OC/0012163
John Langford, "Tutorial on Practical Prediction Theory for Classification", Journal of Machine Learning Research 6 (2005): 273--306 [For the PAC-Bayesian result]
Kasper Green Larsen, "Bagging is an Optimal PAC Learner", arxiv:2212.02264
A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, "Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains", 0709.2989
Guillaume Lecué
- "Simultaneous adaptation to the margin and to complexity in classification", math.ST/0509696
- "Empirical risk minimization is optimal for the convex aggregation problem", Bernoulli 19 (2013): 2153--2166
Guillaume Lecué and Shahar Mendelson
- "Aggregation via empirical risk minimization", Probability Theory and Related Fields 145 (2009): 591--613
- "Sharper lower bounds on the performance of the empirical risk minimization algorithm", Bernoulli 16 (2010): 605--613, arxiv:1102.4983
- "General nonexact oracle inequalities for classes with a subexponential envelope", Annals of Statistics 40 (2012): 832--860
- "Learning subgaussian classes : Upper and minimax bounds", arxiv:1305.4825
Feng Liang, Sayan Mukherjee, Mike West, "The Use of Unlabeled Data in Predictive Modeling", arxiv:0710.4618 = Statistical Science 22 (2007): 189--205
Nick Littlestone and Manfred K. Warmuth, "Relating Data Compression and Learnability", 1986 preprint?
Djamal Louani, Sidi Mohamed Ould Maouloud, "Large Deviation Results for the Nonparametric Regression Function Estimator on Functional Data", arxiv:1111.5989
Gabor Lugosi and Marten Wegkamp, "Complexity regularization via localized random penalties", Annals of Statistics 32 (2004): 16799--1697 = math.ST/0410091
Malik Magdon-Ismail, "Permutation Complexity Bound on Out-Sample Error", NIPS 2010 [PDF reprint]
Dörthe Malzahn and Manfred Opper, "A statistical physics approach for the analysis of machine learning algorithms on real data", Journal of Statistical Mechanics: Theory and Experiment (2005): P11001
Pascal Massart and Élodie Nédélec, "Risk bounds for statistical learning", math.ST/0702683 = Annals of Statistics 34 (2006): 2326--2366
Andreas Maurer, "A Note on the PAC Bayesian Theorem", cs.LG/0411099
Andreas Maurer, Massimiliano Pontil, "Structured Sparsity and Generalization", Journal of Machine Learning Research 13 (2012): 671--690
Shahar Mendelson
- "Improving the Sample Complexity Using Global Data", IEEE Transactions on Information Theory 48 (2002): 1977--1991
- "Lower Bounds for the Empirical Minimization Algorithm", IEEE Transactions on Information Theory 54 (2008): 3797--3803 ["simple argument ... that under mild geometric assumptions .... the empirical minimization algorithm cannot yield a uniform error rate that is faster than $1/sqrt{k}$ in the function learning setup".]
Shahar Mendelson and Gideon Schechtman, "The shattering dimension of sets of linear functionals", Annals of Probability 32 (2004): 1746--1770 = math.PR/0410096
P. Mitra, C. A. Murthy and S. K. Pal, "A probabilistic active support vector learning algorithm", IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004): 413--418
Sayan Mukherjee, Partha Niyogi, Tomaso Poggio and Ryan Rifkin, "Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization", Advances in Computational Mathematics 25 (2006): 161--193 [PDF. Thanks to Shiva Kaul for pointing this out to me.]
Boaz Nadler, "Finite sample approximation results for principal component analysis: a matrix perturbation approach", Annals of Statistics 36 (2008): 2791--2817, arxiv:0901.3245
Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner, "Consistent Feature Selection for Pattern Recognition in Polynomial Time", Journal of Machine Learning Research 8 (2007): 589--612
E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R. Figueiras-Vidal and A. Navia-Vazquez, "Growing support vector classifiers with controlled complexity," Pattern Recognition Letters 36 (2003): 1479--1488
Vladimir Pestov
- "PAC learnability of a concept class under non-atomic measures: a problem by Vidyasagar", arxiv:1006.5090
- "PAC learnability versus VC dimension: a footnote to a basic result of statistical learning", arxiv:1104.2097
Maxim Raginsky, "Achievability results for statistical learning under communication constraints", arxiv:0901.1905
Alexander Rakhlin and Karthik Sridharan, "Statistical Learning Theory and Sequential Prediction" [PDF lecture notes]
Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, "Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes", arxiv:0909.1933
Sebastian Risau-Gusman and Mirta B. Gordon
- "Generalization Properties of Finite Size Polynomial Support Vector Machines," cond-mat/0002071 =? "Hieararchical Learning in Polynomial Support Vector Machines," Machine Learning 46 (2002): 53--70
- "Learning curves for Soft Margin Classifiers," cond-mat/0203315
Igor Rivin, "The performance of the batch learner algorithm," cs.LG/0201009
Dana Ron, "Property Testing: A Learning Theory Perspective", Foundations and Trends in Machine Learning 1 (2008): 307--402 [PDF preprint]
Benjamin I. P. Rubinstein, Aleksandr Simma, "On the Stability of Empirical Risk Minimization in the Presence of Multiple Risk Minimizers", IEEE Transactions on Information Theory 58 (2012): 4160--4163, arxiv:1002.2044
J. Hyam Rubinstein, Benjamin I. P. Rubinstein, Peter L. Bartlett, "Bounding Embeddings of VC Classes into Maximum Classes", arxiv:1401.7388
Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri, "Are Loss Functions All the Same?", Neural Computation 16 (2004): 1063--1076
Cynthia Rudin, "Stability Analysis for Regularized Least Squares Regression", cs.LG/0502016
Sivan Sabato, Nathan Srebro, Naftali Tishby, "Distribution-Dependent Sample Complexity of Large Margin Learning", Journal of Machine Learning Research 14 (2013): 2119--2149
Narayana Santhanam, Venkatachalam Anantharam, Wojciech Szpankowski, "Data-Derived Weak Universal Consistency", Journal of Machine Learning Research 23 (2022): 27
Tom Schaul, Sixin Zhang, Yann LeCun, "No More Pesky Learning Rates", arxiv:1206.1106
Milad Sefidgaran, Abdellatif Zaidi, "Data-dependent Generalization Bounds via Variable-Size Compressibility", arxiv:2303.05369
Yevgeny Seldin, Nicolò Cesa-Bianchi, Françe;ois Laviolette, Peter Auer, John Shawe-Taylor, Jan Peters, "PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off", arxiv:1105.4585
Xuhui Shao, Vladimir Cherkassky and William Li, "Measuring the VC-Dimension Using Optimized Experimental Design," Neural Computation 12 (2000): 1969--1986
Xiaotong Shen and Lifeng Wang, "Generalization error for multi-class margin classification", Electronic Journal of Statistics 1 (2007): 307--330, arxiv:0708.3556
I. Steinwart, "Consistency of Support Vector Machines and Other Regularized Kernel Classifiers", IEEE Transactions on Information Theory 51 (2005): 128--142
Ingo Steinwart and Andreas Christmann, Support Vector Machines
Joe Suzuki, "On Strong Consistency of Model Selection in Classification", IEEE Transactions on Information Theory 52 (2006): 4767--4774 [Based on information-theoretic criteria]
Matus Telgarsky, "Stochastic linear optimization never overfits with quadratically-bounded losses on general data", arxiv:2202.06915
Sara A. van de Geer
- "On non-asymptotic bounds for estimation in generalized linear models with highly correlated design", 0709.0844
- "High-dimensional generalized linear models and the lasso", Annals of Statistics 36 (2008): 614--645 = arxiv:0804.0703
Aad van der Vaart and Jon A. Wellner, "A note on bounds for VC dimensions", pp. 103--107 in Christian Houdré, Vladimir Koltchinskii, David M. Mason and Magda Peligrad (eds.), High Dimensional Probability V: The Luminy Volume ["We provide bounds for the VC dimension of class of sets formed by unions, intersections, and products of VC classes of sets". Open access]
V. Vapnik and O. Chapelle, "Bounds on Error Expectation for Support Vector Machines," Neural Computation 12 (2000): 2013--2036
Silvia Villa, Lorenzo Rosasco, Tomaso Poggio, "On Learnability, Complexity and Stability", arxiv:1303.5976
Ulrike von Luxburg, Bernhard Schoelkopf, "Statistical Learning Theory: Models, Concepts, and Results", arxiv:0810.4752
Yu-Xiang Wang, Huan Xu, "Stability of matrix factorization for collaborative filtering", ICML 2012, arxiv:1206.4640
Huan Xu and Shie Mannor, "Robustness and Generalization", Machine Learning 86 (2012): 391--423 arxiv:1005.2243
Yuhong Yang and Andrew Barron, "Information-theoretic determination of minimax rates of convergence", Annals of Statistics 27 (1999): 1564--1599
Kai Yu and Tong Zhang, "High Dimensional Nonlinear Learning using Local Coordinate Coding", arxiv:0906.5190
Alon Zakai and Ya'acov Ritov, "Consistency and Localizability", Journal of Machine Learning Reserch 10 (2009): 827--856
Chao Zhang and Dachen Tao, "Generalization Bound for Infinitely Divisible Empirical Process", Journal of Machine Learning Research Workshops and Conference Proceedings 15 (2011): 864--872
Tong Zhang
- "Covering Number Bounds of Certain Regularized Linear Function Classes", Journal of Machine Learning Research 2 (2002): 527--550
- "Learning Bounds for Kernel Regression Using Effective Data Dimensionality", Neural Computation 17 (2005): 2077--2098
- "Information-Theoretic Upper and Lower Bounds for Statistical Estimation", IEEE Transactions on Information Theory 52 (2006): 1307--1321
Ying Zhu, "Phase transitions in nonparametric regressions: a curse of exploiting higher degree smoothness assumptions in finite samples", arxiv:2112.03626
Or Zuk, Shiri Margel, Eytan Domany, "On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network", UAI 2006, arxiv:1206.6862