Optimization

Last update: 05 Sep 2025 09:31
First version: Before 14 September 2013

A more than usually inadequate palceholder.

One important question: why does gradient descent work so well in machine learning, especially for neural networks?

Aharon Ben-Tal and Arkadi Nemirovski, Lectures on Modern Convex Optimization [PDF via Prof. Nemirovski]
Cristopher Moore and Stephan Mertens, The Nature of Computation [Cris and Stephan were kind enough to let me read this in manuscript; it's magnificent. Review: Intellects Vast and Warm and Sympathetic]

Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, Martin J. Wainwright, "Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization", arxiv:1009.0571
Aureli Alabert, Alessandro Berti, Ricard Caballero, Marco Ferrante, "No-Free-Lunch Theorems in the continuum", arxiv:1409.2175
Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning", in Sra et al. (eds.), below [PDF reprint via Dr. Bottou]
Venkat Chandrasekaran and Michael I. Jordan, "Computational and Statistical Tradeoffs via Convex Relaxation", Proceedings of the National Academy of Sciences (USA) 110 (2013): E1181--E1190, arxiv:1211.1073
Paul Mineiro, Nikos Karampatziakis, "Loss-Proportional Subsampling for Subsequent ERM", arxiv:1306.1840
Maxim Raginsky, Alexander Rakhlin, "Information-based complexity, feedback and dynamics in convex programming", arxiv:1010.2285
Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning
Ben Recht, Convex Optimization Live Blog (2024)
Margaret H. Wright, "The interior-point revolution in optimization: History, recent developments, and lasting consequences", Bulletin of the Amererican Mathematical Society (New Series) 42 (2005): 39--56
David H. Wolpert and William G. Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1997): 67--82, [SFI Working Paper]

Richard Bellman (ed.), Mathematical Optimization Techniques [University of California Press, 1963]

George B. Dantzig, Linear Programming and Extensions
Giorgio Giorgi and Tinne Hoff Kjeldsen (eds.), Traces and Emergence of Nonlinear Programming
I. Gumowski and C. Mira, Optimization in Control Theory and Practice
William Thomas, Rational Action: The Sciences of Policy in Britain and America, 1940--1960

Dimitri P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods [PDF reprint via Prof. Bertsekas]
Stephen Boyd and Lieven Vandenberghe, Convex Optimization [Yes, I should have finished this decades ago; no, I haven't]
Jan Brinkhuis and Vladimir Tikhomirov, Optimization: Insights and Applications
Giuseppe C. Calafiore and Laurent El Ghaoui, Optimization Models
Paulo Cortez, Modern Optimization with R
B. Guenin, J. Könemann and L. Tunçel, A Gentle Introduction to Optimization
Andrzej Ruszczynski, Nonlinear Optimization
James C. Spall, Introduction to Stochastic Search and Optimization
Rangarajan K. Sundaram, First Course in Optimization Theory
Nisheeth K. Vishnoi, Algorithms for Convex Optimization
Stephen J. Wright and Benjamin Recht, Optimization for Data Analysis

Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp, "Living on the edge: Phase transitions in convex programs with random data", arxiv:1303.6672
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas, "Learning to learn by gradient descent by gradient descent", arxiv:1606.04474
Giorgio Ausiello, Alberto Marchetti-Spaccamela, Pierluigi Crescenzi, Giorgio Gambosi, Marco Protasi, Viggo Kann, Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties
Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, "Optimization with Sparsity-Inducing Penalties", Foundations and Trends in Machine Learning 4 (2011): 1--106, arxiv:1108.0775
Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath, "Online stochastic gradient descent on non-convex losses from high-dimensional inference", arxiv:2003.10409
Aharon Ben-Tal, Laurent El Ghaoui and Arkadi Nemirovski, Robust Optimization
Yatao Bian, Xiong Li, Yuncai Liu, Ming-Hsuan Yang, "Parallel Coordinate Descent Newton Method for Efficient $\ell_1$-Regularized Minimization", arxiv:1306.4080
Sébastien Bubeck
- Introduction to Online Optimization
- "Convex Optimization: Algorithms and Complexity", Foundations and Trends in Machine Learning 8 (2015): 231--357, arxiv:1405.4980
Rocco Caprio, Juan Kuntz, Samuel Power, Adam M. Johansen, "Error bounds for particle gradient descent, and extensions of the log-Sobolev and Talagrand inequalities", arxiv:2403.02004
C. Cartis, N. I. M. Gould and Ph. L. Toint, "On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization", SIAM Journal on Optimization 20 (2010): 2833--2852
Miguel Á. Carreira-Perpiñán and Weiran Wang, "Distributed optimization of deeply nested systems", arxiv:1212.5921 [Remarkable if true]
Sourav Chatterjee, "Convergence of gradient descent for deep neural networks", arxiv:2203.16462
Guilherme França, Daniel P. Robinson, and René Vidal, "Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization", Physical Review E 103 (2021): 053304
Moritz Hardt, Tengyu Ma, Benjamin Recht, "Gradient Descent Learns Linear Dynamical Systems", Journal of Machine Learning Research 19 (2018): 29
Elad Hazan, Satyen Kale, "Projection-free Online Learning", arxiv:1206.4657
Tamir Hazan, George Papandreou, and Daniel Tarlow (eds.), Perturbations, Optimization, and Statistics
Philipp Hennig and Martin Kiefel, "Quasi-Newton Methods: A New Direction", arxiv:1206.4602
Prateek Jain, Purushottam Kar, "Non-convex Optimization for Machine Learning", Foundations and Trends in Machine Learning 10 (2017): 142--336, arxiv:1712.07897
Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi, "Phase transitions in semidefinite relaxations", Proceedings of the National Academy of Sciences 113 (2016): E2218--E2223
Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade and Michael I. Jordan, "On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points", Journal of the ACMV 68 (2021): 11
Michael I. Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, Manolis Zampetakis, "Deterministic Nonsmooth Nonconvex Optimization", arxiv:2302.08300
Soeren Laue, "A Hybrid Algorithm for Convex Semidefinite Optimization", arxiv:1206.4608
Adrian S. Lewis, Tonghua Tian, "The complexity of first-order optimization methods from a metric perspective", arxiv:2305.03208
Mehrdad Mahdavi, Rong Jin, "MixedGrad: An O(1/T) Convergence Rate Algorithm for Stochastic Smooth Optimization", arxiv:1307.7192
Mehrdad Mahdavi, Rong Jin, Tianbao Yang, "Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints", arxiv:1111.6082
Luke Metz, C. Daniel Freeman, Niru Maheswaranathan, Jascha Sohl-Dickstein, "Training Learned Optimizers with Randomly Initialized Learned Optimizers", arxiv:2101.07367
Peter Orbanz, "Global optimality under amenable symmetry constraints", arxiv:2402.07613
N. Parikh and S. Boyd, "Proximal Algorithms", Foundations and Trends in Optimization 1 (2014): 123--231 [PDF reprint and other resources via Prof. Boyd]
Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington, "Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions", arxiv:2206.07252
Luc Pronzato, Henry P. Wynn and Anatoly A. Zhigljavsky
- Dynamical Search: Applications of Dynamical Systems in Search and Optimization
- "An Introduction to Dynamical Search", pp. 115--150 in Panos M. Pardalos and H. Edwin Romeijn (eds.), Handbook of Global Optimization, volume 2
Debasish Roy and G. Visweswara Rao, Stochastic Dynamics, Filtering and Optimization
Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun, "Explorations on high dimensional landscapes", arxiv:1412.6615
Ohad Shamir, Tong Zhang, "Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes", arxiv:1212.1824
Rachael Tappenden, Peter Richtárik, Jacek Gondzio, "Inexact Coordinate Descent: Complexity and Preconditioning", arxiv:1304.5530
Matus Telgarsky, "Stochastic linear optimization never overfits with quadratically-bounded losses on general data", arxiv:2202.06915
Belinda Tzen, Anant Raj, Maxim Raginsky, Francis Bach, "Variational Principles for Mirror Descent and Mirror Langevin Dynamics", arxiv:2303.09532
Matthew D. Zeiler, "ADADELTA: An Adaptive Learning Rate Method", arxiv:1212.5701
Lijun Zhang, Tianbao Yang, Rong Jin, Xiaofei He, "O(logT) Projections for Stochastic Optimization of Smooth and Strongly Convex Functions", arxiv:1304.0740
Hui Zhang, Wotao Yin, "Gradient methods for convex minimization: better rates under weaker conditions", arxiv:1303.4645