Notebooks

## Learning Theory (Formal, Computational or Statistical)

21 Sep 2022 16:00

I qualify it to distinguish this area from the broader field of machine learning, which includes much more with lower standards of proof, and from the theory of learning in organisms, which might be quite different.

The basic set-up is as follows. We have a bunch of inputs and outputs, and an unknown relationship between the two. We do have a class of hypotheses describing this relationship, and suppose one of them is correct. (The hypothesis class is always circumscribed, but may be infinite.) A learning algorithm takes in a set of inputs and outputs, its data, and produces a hypothesis. Generally we assume the data are generated by some random process, and the hypothesis changes as the data change. The key notion is that of a probably approximately correct learning algorithm --- one where, if we supply enough data, we can get a hypothesis with an arbitrarily small error, with a probability arbitrarily close to one.

Generally, PAC-results concern (1) the existence of a PAC algorithm, (2) quantifying how much data we need, in terms of either accuracy or reliability, or (3) devising new PAC algorithms with other desirable properties. What frustrates me about this literature, and the reason I don't devote more of my research to it (aside, of course, from my sheer incompetence) is that almost all of it assumes the data are statistically independent and identically distributed. Then PAC-like results follow essentially from extensions of the ordinary Law of Large Numbers. What's really needed, however, is something more like an ergodic theorem, for suitably-dependent data. That, however, gets its own notebook.

An interesting question (which I learned of from Vidyasagar's book) has to do with the difference between distribution-free and distribution-dependent bounds. Generally, the latter are sharper, sometimes much sharper, but this comes at the price of making more or less strong parametric assumptions about the distribution. (One might indeed think of the theory of parametric statistical inference as learning theory with very strong distributional assumptions.) However, even in the distribution-free set up, we have a whole bunch of samples from the distribution, and non-parametric density estimation is certainly possible --- could one, e.g., improve the bounds by using half the sample to estimate the distribution, and then applying a distribution-dependent bound? Or will the uncertainty in the distributional estimate necessarily kill any advantage we might get from learning about the distribution? It feels like the latter would say something pretty deep (and depressing) about the whole project of observational science...

Recommended, bigger picture:
• Martin Anthony and Peter C. Bartlett, Neural Network Learning: Theoretical Foundations [The theory is much broader than just neural networks]
• Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [PDF. 39 pp. review on how to bound the error of your learning algorithms.]
• Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
• Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines [While SVMs are one particular technology among others, this book does an excellent job of crisply introducing the general theory of learning, and showing its practicality.]
• Michael J. Kearns and Umesh V. Vazirani, An Introduction to Computational Learning Theory [Review: How to Build a Better Guesser]
• John Lafferty and Larry Wasserman, Statistical Machine Learning [Unpublished lecture notes]
• Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, Foundations of Machine Learning [Review]
• Maxim Raginsky, Statistical Learning Theory [Class webpage, with excellent notes and further readings]
• V. N. Vapnik, The Nature of Statistical Learning Theory [Review: A Useful Biased Estimator]
• Mathukumalli Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems [Mini-review.]
Recommended, close-ups (very misc.):
• Terrence M. Adams and Andrew B. Nobel, "Uniform Approximation of Vapnik-Chervonenkis Classes", arxiv:1010.4515
• Alekh Agarwal, John C. Duchi, Peter L. Bartlett, Clement Levrard, "Oracle inequalities for computationally budgeted model selection" [COLT 2011]
• David Balduzzi
• Moulinath Banerjee, "Covering Numbers and VC dimension" [review notes, UM statistics department, 2004; PS]
• Peter L. Bartlett and Shahar Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results", Journal of Machine Learning Research 3 (2002): 463--482
• Jonathan Baxter, "A Model of Inductive Bias Learning," Journal of Artificial Intelligence Research 12 (2000): 149--198 [How to learn what class of hypotheses you should be trying to use, i.e., your inductive bias. Assumes independence, again.]
• Mikhail Belkin, "Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation", arxiv:2105.14368
• Olivier Bousquet and André Elisseeff, "Stability and Generalization", Journal of Machine Learning Research 2 (2002): 499--526 [A very cool paper, resting on the idea that if your learning algorithm is stable under small perturbations of the data, and you have a lot of data, then how well it predicts the training data has to be a good estimate of how well it will predict future data. Clearly related to Pedro Domingos's "process-oriented evaluation", which they don't mention, but more formal.]
• Olivier Catoni, Statistical Learning Theory and Stochastic Optimization [Mini-review]
• N. Cesa-Bianchi, A. Conconi and C. Gentile, "On the Generalization Ability of On-Line Learning Algorithms", IEEE Transactions on Information Theory 50 (2004): 2050--2057
• Corinna Cortes, Spencer Greenberg, Mehryar Mohri, "Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions", arxiv:1310.5796
• Corinna Cortes, Yishay Mansour and Mehryar Mohri, "Learning Bounds for Importance Weights", NIPS 23 (2010) [From my point of view, the most interesting part of this is actually the very elegant generalization bound for unbounded loss functions, developed in the supplemental materials, and now followed up by Cortes, Greenberg and Mohri. Comments.]
• Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, "Preserving Statistical Validity in Adaptive Data Analysis", arxiv:1411.2664
• Lee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer, "Efficient Regression in Metric Spaces via Approximate Lipschitz Extension", arxiv:1111.4470
• Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", forthcoming in NeurIPS 2021, arxiv:2110.01584
• Ralf Herbrich, Robert C. Williamson, "Algorithmic Luckiness", Journal of Machine Learning Research 3 (2002): 175--212
• Michael Jordan, Theoretical Statistics 210B [Link is to the Spring 2008 version of the course, which is the latest I can find right now.]
• Michael J. Kearns and Dana Ron, "Algorithmic stability and sanity-check bounds for leave-one-out cross-validation", Neural Computation 11 (1999): 1427--1453 [PDF preprint via Prof. Kearns]
• Aryeh Kontorovich, "Concentration in unbounded metric spaces and algorithmic stability", arxiv:1309.1007
• Ben London, Bert Huang, Lise Getoor, "Graph-based Generalization Bounds for Learning Binary Relations", arxiv:1302.5348
• Gabor Lugosi, Shahar Mendelson and Vladimir Koltchinskii, "A note on the richness of convex hulls of VC classes", Electronic Communications in Probability 8 (2003): 18 ["We prove the existence of a class $A$ of subsets of $R^d$ of VC dimension 1 such that the symmetric convex hull $F$ of the class of characteristic functions of sets in $A$ is rich in the following sense. For any absolutely continuous probability measure $m$ on $R^d$, measurable set $B$ and $h >0$, there exists a function $f$ in $F$ such that the measure of the symmetric difference of $B$ and the set where $f$ is positive is less than $h$." The astonishingly simple proof turns on the Borel isomorphism theorem, which says that says the Borel sets of any complete, separable metric space are in one-to-one correspondence with the Borel sets of the unit interval.]
• Shahar Mendelson, "Learning without Concentration", COLT 2014: 25--39, arxiv:1401.0304
• David Pollard
• "Asymptotics via Empirical Processes", Statistical Science 4 (1989): 341--354
• Empirical Processes: Theory and Applications [Review]
• Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
• Ali Rahimi and Benjamin Recht, "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", NIPS 2008
• Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari, "Online Learning: Random Averages, Combinatorial Parameters, and Learnability", arxiv:1006.1138 [This is not an easy paper to read, but the results are quite deep, and it repays the effort.]
• Philippe Rigollet and Xin Tong, "Neyman-Pearson classification, convexity and stochastic constraints", Journal of Machine Learning Research 12 (2011): 2831--2855, arxiv:1102.5750
• Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
• C. Scott and R. Nowak, "A Neyman-Pearson Approach to Statistical Learning", IEEE Transactions on Information Theory 51 (2005): 3806--3819 [Comments]
• Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik Sridharan, "Learnability, Stability and Uniform Convergence", Journal of Machine Learning Research 11 (2010): 2635--2670 [If one looks at a broader domain than the usual supervised regression or classification problems, uniform convergence of risks is not required for learnability, but the existence of a stable learning algorithm is. Of course much turns on definitions here, and I'm not completely certain, after only one reading, that I really buy all of theirs...]
• John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Transactions on Information Theory 44 (1998): 1926--1940 [There used to be a PDF on Prof. Williamson's website, but apparently no longer]
• Nathan Srebro, Karthik Sridharan, Ambuj Tewari, "Smoothness, Low-Noise and Fast Rates", arxiv:1009.3896
• J. Michael Steele, The Scary Sequences Project [Notes and references on individual-sequence prediction]
• Matus Telgarsky and Sanjoy Dasgupta, "Moment-based Uniform Deviation Bounds for $k$-means and Friends", arxiv:1311.1903
• Sara van de Geer, Empirical Processes in M-Estimation
• Ramon van Handel, "The universal Glivenko-Cantelli property", Probability Theory and Related Fields 155 (2013): 911--934, arxiv:1009.4434
• Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization", Communications of the ACM 64 (2021): 107--115 [previous version: arxiv:1611.03530]
• Xiaojin Zhu, Timothy T. Rogers and Bryan R. Gibson, "Human Rademacher Complexity", in Advances in Neural Information Processing, vol. 22 (NIPS 2009) [my commentary]
Modesty forbids me to recommend:
• Daniel J. McDonald, CRS and Mark Schervish, "Estimated VC Dimension for Risk Bounds", arxiv:1111.3404 [More]
• Yohji Akama, Kei Irie, "VC dimension of ellipsoids", arxiv:1109.4347
• Pierre Alquier, "PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers", arxiv:0712.1698
• Andris Ambainis, "Probabilistic and Team PFIN-type Learning: General Properties", cs.LG/0504001
• Sylvain Arlot, Peter L. Bartlett, "Margin-adaptive model selection in statistical learning", Bernoulli 17 (2011): 687--713, arxiv:0804.2937
• Jean-Yves Audibert, "Fast learning rates in statistical inference through aggregation", Annals of Statistics 37 (2009): 1591--1646, math.ST/0703854
• Jean-Yves Audibert and Olivier Bousquet, "Combining PAC-Bayesian and Generic Chaining Bounds", Journal of Machine Learning Research 8 (2007): 863--889
• Francis Bach, "Sharp analysis of low-rank kernel matrix approximations", arxiv:1208.2015
• Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon, "Unsupervised Supervised Learning II: Training Margin Based Classifiers without Labels", arxiv:1003.0470
• Peter L. Bartlett, Olivier Bousquet, Shahar Mendelson, "Local Rademacher complexities", Annals of Statistics 33 (2005): 1497--1537, arxiv:math/0508275
• Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, Ronald A. DeVore, "Approximation and learning by greedy algorithms", Annals of Statistics 36 (2008): 64--94, arxiv:0803.1718
• José Bento and Andrea Montanari, "Which graphical models are difficult to learn?", arxiv:0910.5761
• G. Biau and L. Devroye, "A Note on Density Model Size Testing", IEEE Transactions on Information Theory 50 (2004): 576--581 [Testing which member of a nested family of model classes you need to use]
• Gilles Blanchard, Olivier Bousquet, Pascal Massart, "Statistical performance of support vector machines", Annals of Statistics 36 (2008): 489--531, arxiv:0804.0551
• Gilles Blanchard and Francois Fleuret, "Occam's hammer: a link between randomized learning and multiple testing FDR control", math.ST/0608713
• Olivier Bousquet, Yegor Klochkov, Nikita Zhivotovskiy, "Sharper bounds for uniformly stable algorithms", arxiv:1910.07833
• Christian Brownlees, Emilien Joly, Gabor Lugosi, "Empirical risk minimization for heavy-tailed losses", arxiv:1406.2462
• Arnaud Buhot and Mitra B. Gordon, "Storage Capacity of the Tilinglike Learning Algorithm," cond-mat/0008162
• Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, "Rigorous Bounds to Retarded Learning," cond-mat/0201256
• Stephane Canu, Xavier Mary, Alain Rakotomamonjy, "Functional learning through kernels", arxiv:0910.1013
• Andrea Caponnetto and Alexander Rakhlin, "Stability Properties of Empirical Risk Minimization over Donsker Classes", Journal of Machine Learning Research 7 (2006): 2565--2583
• Olivier Catoni
• Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate, "Differentially Private Empirical Risk Minimization", Journal of Machine Learning Research 12 (2011): 1069--1109
• Olivier Chapelle, Bernhard Schölkopf and Alexander Zien (eds.), Semi-Supervised Learning
• Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi, "Multiple Descent: Design Your Own Generalization Curve", arxiv:2008.01036
• Toby S. Cubitt, Jens Eisert, Michael M. Wolf, "Extracting dynamical equations from experimental data is NP-hard", Physical Review Letters 108 (2012): 120503, arxiv:1005.0005
• Felipe Cucker and Ding Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
• Arnak Dalalyan and Alexandre B. Tsybakov, "Sparse Regression Learning by Aggregation and Langevin Monte-Carlo", arxiv:0903.1223
• Amit Daniely, Shai Shalev-Shwartz, "Optimal learners for multiclass problems", COLT 2014: 287--316
• Ofir David, Shay Moran, Amir Yehudayoff, "On statistical learning via the lens of compression", NeurIPS 2016 pp. 2792--2800
• C. De Mol, E. De Vito, L. Rosasco, "Elastic-Net Regularization in Learning Theory", arxiv:0807.3423
• Joshua V Dillon, Krishnakumar Balasubramanian, Guy Lebanon, "Asymptotic Analysis of Generative Semi-Supervised Learning", arxiv:1003.0024
• Weinan E, Chao Ma, Lei Wu, "The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks", Pure and Applied Functional Analysis 5 (2020): 1145--1460, arxiv:1912.06987
• P. P. B. Eggermont, V. N. LaRiccia, "Uniform error bounds for smoothing splines", arxiv:math/0612776
• Vitaly Feldman, "Robustness of Evolvability" [PDF preprint]
• Majid Fozunbal, Ton Kalker, "Decision Making with Side Information and Unbounded Loss Functions", arxiv:cs/0601115
• Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722, math.ST/0410092
• Elisabeth Gassiat, Ramon Van Handel, "The local geometry of finite mixtures", arxiv:1202.3482
• Servane Gey, "Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts", arxiv:1203.0193
• Carl Gold and Peter Sollich, "Model Selection for Support Vector Machine Classification," cond-mat/0203334
• Lee-Ad Gottlied, Leonid Kontorovich, Elchanan Mossel, "VC bounds on the cardinality of nearly orthogonal function classes", arxiv:1007.4915
• Adityanand Guntuboyina, Bodhisattva Sen, "Covering Numbers for Convex Functions", IEEE Transactions on Information Theory 59 (2013): 1957--1965 arxiv:1304.0147
• Steve Hanneke
• Nancy Heckman, "The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy", arxiv:1111.1915
• Daniel J. L. Herrmann and Dominik Janzing, "Selection Criterion for Log-Linear Models Using Statistical Learning Theory", math.ST/0302079
• Don Hush, clint Scovel and Ingo Steinwart, "Stability of Unstable Learning Algorithms", Machine Learning 67 (2007): 197--206
• Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma, Systems That Learn: An Introduction to Learning Theory
• Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari, "Regularization Techniques for Learning with Matrices", Journal of Machine Learning Research 13 (2012): 1865--1890
• Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, "Loss functions, complexities, and the Legendre transformation", Theoretical Computer Science 313 (2004): 195--207
• Gerard Kerkyacharian and Dominique Picard, "Thresholding in Learning Theory", math.ST/0510271
• Jussi Klemela and Enno Mammen, "Empirical risk minimization in inverse problems", Annals of Statistics 38 (2010): 482--511
• Vladimir Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems
• Natalia Komarova and Igor Rivin, "Mathematics of learning," math.PR/0105235
• Samuel Kutin and Partha Niyogi, "Almost-everywhere algorithmic stability and generalization error", UAI 2002, arxiv:1301.0579
• Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, "Learning Complexity Dimensions for a Continuous-Time Control System," math.OC/0012163
• John Langford, "Tutorial on Practical Prediction Theory for Classification", Journal of Machine Learning Research 6 (2005): 273--306 [For the PAC-Bayesian result]
• A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, "Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains", 0709.2989
• Guillaume Lecué
• Guillaume Lecué and Shahar Mendelson
• Feng Liang, Sayan Mukherjee, Mike West, "The Use of Unlabeled Data in Predictive Modeling", arxiv:0710.4618 = Statistical Science 22 (2007): 189--205
• Nick Littlestone and Manfred K. Warmuth, "Relating Data Compression and Learnability", 1986 preprint?
• Djamal Louani, Sidi Mohamed Ould Maouloud, "Large Deviation Results for the Nonparametric Regression Function Estimator on Functional Data", arxiv:1111.5989
• Gabor Lugosi and Marten Wegkamp, "Complexity regularization via localized random penalties", Annals of Statistics 32 (2004): 16799--1697 = math.ST/0410091
• Malik Magdon-Ismail, "Permutation Complexity Bound on Out-Sample Error", NIPS 2010 [PDF reprint]
• Dörthe Malzahn and Manfred Opper, "A statistical physics approach for the analysis of machine learning algorithms on real data", Journal of Statistical Mechanics: Theory and Experiment (2005): P11001
• Pascal Massart and Élodie Nédélec, "Risk bounds for statistical learning", math.ST/0702683 = Annals of Statistics 34 (2006): 2326--2366
• Andreas Maurer, "A Note on the PAC Bayesian Theorem", cs.LG/0411099
• Andreas Maurer, Massimiliano Pontil, "Structured Sparsity and Generalization", Journal of Machine Learning Research 13 (2012): 671--690
• Shahar Mendelson
• "Improving the Sample Complexity Using Global Data", IEEE Transactions on Information Theory 48 (2002): 1977--1991
• "Lower Bounds for the Empirical Minimization Algorithm", IEEE Transactions on Information Theory 54 (2008): 3797--3803 ["simple argument ... that under mild geometric assumptions .... the empirical minimization algorithm cannot yield a uniform error rate that is faster than $1/sqrt{k}$ in the function learning setup".]
• Shahar Mendelson and Gideon Schechtman, "The shattering dimension of sets of linear functionals", Annals of Probability 32 (2004): 1746--1770 = math.PR/0410096
• P. Mitra, C. A. Murthy and S. K. Pal, "A probabilistic active support vector learning algorithm", IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004): 413--418
• Sayan Mukherjee, Partha Niyogi, Tomaso Poggio and Ryan Rifkin, "Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization", Advances in Computational Mathematics 25 (2006): 161--193 [PDF. Thanks to Shiva Kaul for pointing this out to me.]
• Boaz Nadler, "Finite sample approximation results for principal component analysis: a matrix perturbation approach", Annals of Statistics 36 (2008): 2791--2817, arxiv:0901.3245
• Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner, "Consistent Feature Selection for Pattern Recognition in Polynomial Time", Journal of Machine Learning Research 8 (2007): 589--612
• E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R. Figueiras-Vidal and A. Navia-Vazquez, "Growing support vector classifiers with controlled complexity," Pattern Recognition Letters 36 (2003): 1479--1488
• "PAC learnability of a concept class under non-atomic measures: a problem by Vidyasagar", arxiv:1006.5090
• "PAC learnability versus VC dimension: a footnote to a basic result of statistical learning", arxiv:1104.2097
• Maxim Raginsky, "Achievability results for statistical learning under communication constraints", arxiv:0901.1905
• Alexander Rakhlin and Karthik Sridharan, "Statistical Learning Theory and Sequential Prediction" [PDF lecture notes]
• Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, "Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes", arxiv:0909.1933
• Sebastian Risau-Gusman and Mirta B. Gordon
• "Generalization Properties of Finite Size Polynomial Support Vector Machines," cond-mat/0002071 =? "Hieararchical Learning in Polynomial Support Vector Machines," Machine Learning 46 (2002): 53--70
• "Learning curves for Soft Margin Classifiers," cond-mat/0203315
• Igor Rivin, "The performance of the batch learner algorithm," cs.LG/0201009
• Dana Ron, "Property Testing: A Learning Theory Perspective", Foundations and Trends in Machine Learning 1 (2008): 307--402 [PDF preprint]
• Benjamin I. P. Rubinstein, Aleksandr Simma, "On the Stability of Empirical Risk Minimization in the Presence of Multiple Risk Minimizers", IEEE Transactions on Information Theory 58 (2012): 4160--4163, arxiv:1002.2044
• J. Hyam Rubinstein, Benjamin I. P. Rubinstein, Peter L. Bartlett, "Bounding Embeddings of VC Classes into Maximum Classes", arxiv:1401.7388
• Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana and Alessandro Verri, "Are Loss Functions All the Same?", Neural Computation 16 (2004): 1063--1076 [Ans.: pretty much, apparently, at least if they're reasonably convex]
• Cynthia Rudin, "Stability Analysis for Regularized Least Squares Regression", cs.LG/0502016
• Sivan Sabato, Nathan Srebro, Naftali Tishby, "Distribution-Dependent Sample Complexity of Large Margin Learning", Journal of Machine Learning Research 14 (2013): 2119--2149
• Narayana Santhanam, Venkatachalam Anantharam, Wojciech Szpankowski, "Data-Derived Weak Universal Consistency", Journal of Machine Learning Research 23 (2022): 27
• Tom Schaul, Sixin Zhang, Yann LeCun, "No More Pesky Learning Rates", arxiv:1206.1106
• Yevgeny Seldin, Nicolò Cesa-Bianchi, Françe;ois Laviolette, Peter Auer, John Shawe-Taylor, Jan Peters, "PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off", arxiv:1105.4585
• Xuhui Shao, Vladimir Cherkassky and William Li, "Measuring the VC-Dimension Using Optimized Experimental Design," Neural Computation 12 (2000): 1969--1986
• Xiaotong Shen and Lifeng Wang, "Generalization error for multi-class margin classification", Electronic Journal of Statistics 1 (2007): 307--330, arxiv:0708.3556
• I. Steinwart, "Consistency of Support Vector Machines and Other Regularized Kernel Classifiers", IEEE Transactions on Information Theory 51 (2005): 128--142
• Ingo Steinwart and Andreas Christmann, Support Vector Machines
• Joe Suzuki, "On Strong Consistency of Model Selection in Classification", IEEE Transactions on Information Theory 52 (2006): 4767--4774 [Based on information-theoretic criteria]
• Sara A. van de Geer
• "On non-asymptotic bounds for estimation in generalized linear models with highly correlated design", 0709.0844
• "High-dimensional generalized linear models and the lasso", Annals of Statistics 36 (2008): 614--645 = arxiv:0804.0703
• Aad van der Vaart and Jon A. Wellner, "A note on bounds for VC dimensions", pp. 103--107 in Christian Houdré, Vladimir Koltchinskii, David M. Mason and Magda Peligrad (eds.), High Dimensional Probability V: The Luminy Volume ["We provide bounds for the VC dimension of class of sets formed by unions, intersections, and products of VC classes of sets". Open access]
• V. Vapnik and O. Chapelle, "Bounds on Error Expectation for Support Vector Machines," Neural Computation 12 (2000): 2013--2036
• Silvia Villa, Lorenzo Rosasco, Tomaso Poggio, "On Learnability, Complexity and Stability", arxiv:1303.5976
• Ulrike von Luxburg, Bernhard Schoelkopf, "Statistical Learning Theory: Models, Concepts, and Results", arxiv:0810.4752
• Yu-Xiang Wang, Huan Xu, "Stability of matrix factorization for collaborative filtering", ICML 2012, arxiv:1206.4640
• Huan Xu and Shie Mannor, "Robustness and Generalization", Machine Learning 86 (2012): 391--423 arxiv:1005.2243
• Yuhong Yang and Andrew Barron, "Information-theoretic determination of minimax rates of convergence", Annals of Statistics 27 (1999): 1564--1599
• Kai Yu and Tong Zhang, "High Dimensional Nonlinear Learning using Local Coordinate Coding", arxiv:0906.5190
• Alon Zakai and Ya'acov Ritov, "Consistency and Localizability", Journal of Machine Learning Reserch 10 (2009): 827--856
• Chao Zhang and Dachen Tao, "Generalization Bound for Infinitely Divisible Empirical Process", Journal of Machine Learning Research Workshops and Conference Proceedings 15 (2011): 864--872
• Tong Zhang
• Ying Zhu, "Phase transitions in nonparametric regressions: a curse of exploiting higher degree smoothness assumptions in finite samples", arxiv:2112.03626
• Or Zuk, Shiri Margel, Eytan Domany, "On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network", UAI 2006, arxiv:1206.6862