Optimization
Last update: 05 Sep 2025 09:31First version: Before 14 September 2013
A more than usually inadequate palceholder.
One important question: why does gradient descent work so well in machine learning, especially for neural networks?
- See also:
	
 - Calculus of Variations and Optimal Control Theory
 - Computational Complexity
 - Control Theory
 - Decision Theory
 - Economics
 - Evolutionary Computation
 - Learning in Games
 - Learning Theory
 - Low-Regret Learning
 - Math I Ought to Learn
 - Planned Economies
 - Stochastic Approximation
 
- Recommended, big picture:
	
 - Aharon Ben-Tal and Arkadi Nemirovski, Lectures on Modern Convex Optimization [PDF via Prof. Nemirovski]
 - Cristopher Moore and Stephan Mertens, The Nature of Computation [Cris and Stephan were kind enough to let me read this in manuscript; it's magnificent. Review: Intellects Vast and Warm and Sympathetic]
 
- Recommended, close-ups:
	
 - Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, Martin J. Wainwright, "Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization", arxiv:1009.0571
 - Aureli Alabert, Alessandro Berti, Ricard Caballero, Marco Ferrante, "No-Free-Lunch Theorems in the continuum", arxiv:1409.2175
 - Léon Bottou and Olivier Bosquet, "The Tradeoffs of Large Scale Learning", in Sra et al. (eds.), below [PDF reprint via Dr. Bottou]
 - Venkat Chandrasekaran and Michael I. Jordan, "Computational and Statistical Tradeoffs via Convex Relaxation", Proceedings of the National Academy of Sciences (USA) 110 (2013): E1181--E1190, arxiv:1211.1073
 - Paul Mineiro, Nikos Karampatziakis, "Loss-Proportional Subsampling for Subsequent ERM", arxiv:1306.1840
 - Maxim Raginsky, Alexander Rakhlin, "Information-based complexity, feedback and dynamics in convex programming", arxiv:1010.2285
 - Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning
 - Ben Recht, Convex Optimization Live Blog (2024)
 - Margaret H. Wright, "The interior-point revolution in optimization: History, recent developments, and lasting consequences", Bulletin of the Amererican Mathematical Society (New Series) 42 (2005): 39--56
 - David H. Wolpert and William G. Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1997): 67--82, [SFI Working Paper]
 
- Recommended, historical interest:
	
 - Richard Bellman (ed.), Mathematical Optimization Techniques [University of California Press, 1963]
 
- To read, history and historical interest:
	
 - George B. Dantzig, Linear Programming and Extensions
 - Giorgio Giorgi and Tinne Hoff Kjeldsen (eds.), Traces and Emergence of Nonlinear Programming
 - I. Gumowski and C. Mira, Optimization in Control Theory and Practice
 - William Thomas, Rational Action: The Sciences of Policy in Britain and America, 1940--1960
 
- To read, textbooks/big picture:
	
 - Dimitri P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods [PDF reprint via Prof. Bertsekas]
 - Stephen Boyd and Lieven Vandenberghe, Convex Optimization [Yes, I should have finished this decades ago; no, I haven't]
 - Jan Brinkhuis and Vladimir Tikhomirov, Optimization: Insights and Applications
 - Giuseppe C. Calafiore and Laurent El Ghaoui, Optimization Models
 - Paulo Cortez, Modern Optimization with R
 - B. Guenin, J. Könemann and L. Tunçel, A Gentle Introduction to Optimization
 - Andrzej Ruszczynski, Nonlinear Optimization
 - James C. Spall, Introduction to Stochastic Search and Optimization
 - Rangarajan K. Sundaram, First Course in Optimization Theory
 - Nisheeth K. Vishnoi, Algorithms for Convex Optimization
 - Stephen J. Wright and Benjamin Recht, Optimization for Data Analysis
 
- To read, close-ups:
	
 - Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp, "Living on the edge: Phase transitions in convex programs with random data", arxiv:1303.6672
 - Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas, "Learning to learn by gradient descent by gradient descent", arxiv:1606.04474
 - Giorgio Ausiello, Alberto Marchetti-Spaccamela, Pierluigi Crescenzi, Giorgio Gambosi, Marco Protasi, Viggo Kann, Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties
 - Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, "Optimization with Sparsity-Inducing Penalties", Foundations and Trends in Machine Learning 4 (2011): 1--106, arxiv:1108.0775
 - Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath, "Online stochastic gradient descent on non-convex losses from high-dimensional inference", arxiv:2003.10409
 - Aharon Ben-Tal, Laurent El Ghaoui and Arkadi Nemirovski, Robust Optimization
 - Yatao Bian, Xiong Li, Yuncai Liu, Ming-Hsuan Yang, "Parallel Coordinate Descent Newton Method for Efficient $\ell_1$-Regularized Minimization", arxiv:1306.4080
 - Sébastien Bubeck
		
- Introduction to Online Optimization
 - "Convex Optimization: Algorithms and Complexity", Foundations and Trends in Machine Learning 8 (2015): 231--357, arxiv:1405.4980
 
 - Rocco Caprio, Juan Kuntz, Samuel Power, Adam M. Johansen, "Error bounds for particle gradient descent, and extensions of the log-Sobolev and Talagrand inequalities", arxiv:2403.02004
 - C. Cartis, N. I. M. Gould and Ph. L. Toint, "On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization", SIAM Journal on Optimization 20 (2010): 2833--2852
 - Miguel Á. Carreira-Perpiñán and Weiran Wang, "Distributed optimization of deeply nested systems", arxiv:1212.5921 [Remarkable if true]
 - Sourav Chatterjee, "Convergence of gradient descent for deep neural networks", arxiv:2203.16462
 - Guilherme França, Daniel P. Robinson, and René Vidal, "Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization", Physical Review E 103 (2021): 053304
 - Moritz Hardt, Tengyu Ma, Benjamin Recht, "Gradient Descent Learns Linear Dynamical Systems", Journal of Machine Learning Research 19 (2018): 29
 - Elad Hazan, Satyen Kale, "Projection-free Online Learning", arxiv:1206.4657
 - Tamir Hazan, George Papandreou, and Daniel Tarlow (eds.), Perturbations, Optimization, and Statistics
 - Philipp Hennig and Martin Kiefel, "Quasi-Newton Methods: A New Direction", arxiv:1206.4602
 - Prateek Jain, Purushottam Kar, "Non-convex Optimization for Machine Learning", Foundations and Trends in Machine Learning 10 (2017): 142--336, arxiv:1712.07897
 - Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi, "Phase transitions in semidefinite relaxations", Proceedings of the National Academy of Sciences 113 (2016): E2218--E2223
 - Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade and Michael I. Jordan, "On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points", Journal of the ACMV 68 (2021): 11
 - Michael I. Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, Manolis Zampetakis, "Deterministic Nonsmooth Nonconvex Optimization", arxiv:2302.08300
 - Soeren Laue, "A Hybrid Algorithm for Convex Semidefinite Optimization", arxiv:1206.4608
 - Adrian S. Lewis, Tonghua Tian, "The complexity of first-order optimization methods from a metric perspective", arxiv:2305.03208
 - Mehrdad Mahdavi, Rong Jin, "MixedGrad: An O(1/T) Convergence Rate Algorithm for Stochastic Smooth Optimization", arxiv:1307.7192
 - Mehrdad Mahdavi, Rong Jin, Tianbao Yang, "Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints", arxiv:1111.6082
 - Luke Metz, C. Daniel Freeman, Niru Maheswaranathan, Jascha Sohl-Dickstein, "Training Learned Optimizers with Randomly Initialized Learned Optimizers", arxiv:2101.07367
 - Peter Orbanz, "Global optimality under amenable symmetry constraints", arxiv:2402.07613
 - N. Parikh and S. Boyd, "Proximal Algorithms", Foundations and Trends in Optimization 1 (2014): 123--231 [PDF reprint and other resources via Prof. Boyd]
 - Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington, "Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions", arxiv:2206.07252
 - Luc Pronzato, Henry P. Wynn and Anatoly A. Zhigljavsky
		
- Dynamical Search: Applications of Dynamical Systems in Search and Optimization
 - "An Introduction to Dynamical Search", pp. 115--150 in Panos M. Pardalos and H. Edwin Romeijn (eds.), Handbook of Global Optimization, volume 2
 
 - Debasish Roy and G. Visweswara Rao, Stochastic Dynamics, Filtering and Optimization
 - Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun, "Explorations on high dimensional landscapes", arxiv:1412.6615
 - Ohad Shamir, Tong Zhang, "Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes", arxiv:1212.1824
 - Rachael Tappenden, Peter Richtárik, Jacek Gondzio, "Inexact Coordinate Descent: Complexity and Preconditioning", arxiv:1304.5530
 - Matus Telgarsky, "Stochastic linear optimization never overfits with quadratically-bounded losses on general data", arxiv:2202.06915
 - Belinda Tzen, Anant Raj, Maxim Raginsky, Francis Bach, "Variational Principles for Mirror Descent and Mirror Langevin Dynamics", arxiv:2303.09532
 - Matthew D. Zeiler, "ADADELTA: An Adaptive Learning Rate Method", arxiv:1212.5701
 - Lijun Zhang, Tianbao Yang, Rong Jin, Xiaofei He, "O(logT) Projections for Stochastic Optimization of Smooth and Strongly Convex Functions", arxiv:1304.0740
 - Hui Zhang, Wotao Yin, "Gradient methods for convex minimization: better rates under weaker conditions", arxiv:1303.4645