Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenNeural Nets, Connectionism, Perceptrons, etc.
http://bactra.org/notebooks/2024/02/25#neural-nets
<h4>Old notes from c. 2000</h4>
I'm mostly interested in them as a means of <a
href="learning-inference-induction.html">machine learning or statistical
inference</a>. I am particularly interested in their role as models of <a
href="chaos.html">dynamical systems</a> (via recurrent nets, generally), and as
models of <a href="transducers.html">transduction</a>.
<P>I need to understand better how the analogy to spin glasses works, but then,
I need to understand spin glasses better too.
<P>The arguments that connectionist models are superior, for purposes
of <a href="cognitive-science.html">cognitive science</a>, to more "symbolic"
ones I find unconvincing. (Saying that they're more biologically realistic is
like saying that cars are better models of animal locomotion than bicycles,
because cars have four appendages in contact with the ground and not two.)
This is not to say, of course, that some connectionist models of cognition
aren't interesting, insightful and valid; but the same is true of many symbolic
models, and there seems no compelling reason for abandoning the latter in favor
of the former. (For more on this point, see Gary Marcus.) --- <em>Of
course</em> a cognitive model which cannot be implemented in real brains must
be rejected; connecting neurobiology to cognition can hardly be too ardently
desired. The point is that the elements in connectionist models called
"neurons" bear only the sketchiest resemblance to the real thing, and neural
nets are no more than caricatures of real neuronal circuits. Sometimes sketchy
resemblances and caricatures are enough to help us learn, which is why Hebb,
McCulloch and <cite>Neural Computation</cite> are important for both
connectionism and neurobiology.
<h4>Reflections circa 2016</h4>
<P>I first learned about neural networks as an undergraduate in the early
1990s, when, judging by the press, Geoff Hinton and his students were going to
take over the world. (In "Introduction to Cognitive Science" at Berkeley, we
trained a three-layer perceptron to classify fictional characters as "Sharks" or "Jets"
using back-propagation; I had no idea what those labels meant because I'd never
seen <cite>West Side Story</cite>.) I then lived through neural nets virtually
disappearing from the proceedings of Neural Information Processing Systems, and
felt myself very retro for including neural nets the first time I taught data
mining in 2006. (I dropped them by 2009.) The recent revival, as "deep
learning", is a bit weird for me, especially since none of the public rhetoric
has changed. The most interesting thing <em>scientifically</em> about the new
wave is that it's lead to the discovery
of <a href="adversarial-examples.html">adversarial examples</a>, which I think
we still don't understand very well at all. The most interesting thing
meta-scientifically is how much the new wave of excitement about neural
networks seems to be accompanied by <em>forgetting</em> earlier results,
techniques, and baselines.
<h4>Reflections in early 2022</h4>
<P>I would now actually say there are <em>three</em> scientifically interesting
phenomena revealed by the current wave of interest in neural networks:
<ol>
<li> <a href="adversarial-examples.html">Adversarial examples</a> (as revealed by Szegedy et al.), and the converse phenomenon of extremely high confidence classification of nonsense images that have no humanly-perceptible resemblance to the class (e.g., Nguyen et al.);
<li> The ability to generalize to new instances by using humanly-irrelevant features like pixels at the edges of images (e.g., Carter et al.);
<li> The ability to generalize to new instances despite having the capacity to memorize random training data (e.g., Zhang et al.).
</ol>
<P>It's not at all clear how specific any of these are to <em>neural
networks</em>. (See, Belkin's wonderful "Fit without Fear" for a status report
on our progress in understanding my item (3) using other models, going back all
the way to <a href="http://bactra.org/reviews/boosting.html">margin-based
understandings of boosting</a>.) It's also not clear how they inter-relate.
But they are all clearly extremely important phenomena in machine learning
which we do not yet understand, and really, really ought to understand.
<P>I'd add that I <em>still</em> think there has been a remarkable regression
of understanding of the past of our field and some hard-won lessons. When I
hear people <a href="nn-attention-and-transformers.html">conflating "attention" in neural networks with attention in
animals</a>, I start muttering
about <a href="https://doi.org/10.1145/1045339.1045340">"wishful mnemonics",
and "did Drew McDermott live and fight in vain?"</a> Similarly, when I hear
graduate students, and even young professors, explaining
that <a href="http://arxiv.org/abs/1301.3781">Mikolov et al. 2013</a> invented
the idea of representing words by embedding them in a vector space, with
proximity in the space tracking patterns of co-occurrence, as
though <a href="text-mining.html">latent semantic indexing</a> (for instance)
didn't date from the 1980s, I get kind of indignant. (Maybe the new embedding
methods are <em>better</em> for your particular application than Good Old
Fashioned Principal Components, or even than kernelized PCA, but <em>argue</em>
that, dammit.)
<P>I am quite prepared to believe that <em>part</em> of my reaction here is
sour grapes, since deep learning swept all before it right around
the time I got tenure, and I am now too inflexible to really jump on the
bandwagon.
<P>That is my opinion; and it is further my opinion that you kids should
get off of my lawn.
<P><strong>25 July 2022</strong>: In the unlikely event you want to read pages
and pages of me on neural
networks, <a href="http://www.stat.cmu.edu/~cshalizi/dm/22/lectures/21/lecture-21.pdf">try
my lecture notes</a>. (That URL might change in the future.)
<P><strong>26 September 2022</strong>: Things I should learn more about (an incomplete list):
<ol>
<li> "Transformer" architectures, specifically looking at them as ways of doing
sequential probability estimation.
(<a href="nn-attention-and-transformers.html">Now [2023] with their own
irritated notebook.</a>)
<br>If someone <em>were</em> to throw large-language-model-sized computing resources at a Good Old Fashioned SCFG learner, and/or a , what kind of performance would one get on the usual benchmarks? Heck, what if one used a truly capacious implementation of <a href="cep-gzip.html">Lempel-Ziv</a>? (You'd have to back out the probabilities from the LZ code-lengths, but <a href="mdl.html">we know how to do that.</a>) [See same notebook.]
<br>On that note: could one build a GPT-esque program using Lempel-Ziv as the underlying model? Conversely, can we understand transformers as basically doing some sort of source coding? (The latter question is almost certainly addressed in the literature.) [Ditto.]
<li> What's going on with diffusion models for images? (I know, that's really vague.)
<br>While I am proposing brutally stupid experiments: Take a big labeled image data set and do latent semantic analysis on the labels, i.e., PCA on those
bags-of-words, <em>and</em> do PCA on the images themselves. Learn a linear mapping from the word embedding space to the image embedding space. Now take a text query/prompt, map it into the word embedding space (i.e., project on to the word PCs), map that to the image space, and generate an image (i.e., take the appropriate linear combination of image PCs). The result will probably be a bit fuzzy but there should be ways to make it prettier... Of course, after that we
kernelize the linear steps (in all possible combinations).</br>
<li>I do not understand how "self-supervised" learning is supposed to differ from what we always did in <em>un</em>-supervised learning with (e.g.) <a href="mixture-models.html">mixture models</a>, or for that matter how statisticians
have "trained" autoregressions since about 1900.
</ol>
<P><strong>Additional stray thought, recorded 27 May 2023</strong>: The loss
landscape for a neural network, in terms of its weights, is usually very
non-convex, so it's surprising that gradient descent (a.k.a. backpropagation)
works so well. This leads me, unoriginally, to suspect that there is a lot of
hidden structure in the <a href="optimization.html">optimization</a> problem.
Some of this is presumably
just <a href="symmetries-of-neural-networks.html">symmetries</a>. But I do
wonder if there isn't a way to reformulate it all as a convex program. (Though
why gradient descent in the weights would then find it is a bit of a different
question...) Alternately, maybe none of this is true and optimization is just
radically easier than we thought; in that case
I'd <a href="http://bactra.org/weblog/918.html">eat some crow</a>, and be
willing to embrace a lot more <a href="planned-economies.html">central planning
in the future socialist commonwealth</a>.
<P>I presume there are scads of papers on all of these issues, so
points are genuinely appreciated.
<ul>See also:
<li><a href="adversarial-examples.html">Adversarial Examples</a>
<li><a href="ai.html">Artificial Intelligence</a>
<li><a href="cognitive-science.html">Cognitive Science</a>
<li><a href="interpolation.html">Interpolation in Statistical Learning</a>
<li><a href="neuroscience.html">Neuroscience</a>
<li><a href="data-mining.html">Data Mining</a>
<li><a href="symmetries-of-neural-networks.html">Symmetries of Neural Networks</a>
<li><a href="uncertainty-for-neural-networks.html">Uncertainty for Neural Networks, and Other Large Complicated Models</a>
</ul>
<ul>Recommended (big picture):
<li>Maureen Caudill and Charles Butler, <cite><a href="https://doi.org/10.7551/mitpress/4873.001.0001">Naturally Intelligent Systems</a></cite>
<li>Patricia Churchland and Terrence Sejnowski, <cite><a href="https://doi.org/10.7551/mitpress/2010.001.0001">The Computational
Brain</a></cite>
<li>Chris Eliasmith and Charles Anderson, <cite><a href="http://bactra.org/weblog/algae-2006-06.html#neural-engineering">Neural Engineering:
Computation, Representation, and Dynamics in Neurobiological Systems</a></cite>
<li>Gary F. Marcus, <cite>The Algebraic Mind: Integrating Connectionism
and Cognitive Science</cite> [On the limits of the connectionist approach to
cognition, with special reference to <a href="linguistics.html">language and
grammar</a>. Cf. later papers by Marcus below.]
<li>Brian Ripley, <cite>Pattern Recognition and Neural Networks</cite>
</ul>
<ul>Recommended (close-ups; very misc. and not nearly extensive enough):
<li>Larry Abbot and Terrence Sejnowski (eds.), <cite>Neural Codes and
Distributed Representations</cite>
<li>Martin Anthony and Peter C. Bartlett, <cite><a href="http://bactra.org/reviews/anthony-bartlett.html">Neural Network Learning: Theoretical
Foundations</a></cite>
<li>Michael A. Arbib (ed.), <cite>The Handbook of Brain Theory and
Neural Networks</citE>
<li>Dana Ballard, <cite>An Introduction to Natural Computation</cite>
[<a href="../reviews/ballard-natural/">Review: Not Natural Enough</a>]
<li>M. J. Barber, J. W. Clark and C. H. Anderson, "Neural Representation of Probabilistic Information", <cite>Neural Computation</cite>
<strong>15</strong> (2003): 1843--1864, <a href="http://arxiv.org/abs/cond-mat/0108425">arxiv:cond-mat/0108425</a>
<li>Suzanna Becker, "Unsupervised Learning Procedures for Neural Networks", <cite>International Journal of Neural Systems</cite> <strong>2</strong>
(1991): 17--33
<li>Mikhail Belkin, "Fit without fear: Remarkable mathematical phenomena of deep learning through the prism of interpolation", <a href="http://arxiv.org/abs/2105.14368">arxiv:2105.14368</a>
<li>Tolga Ergen, Mert Pilanci, "Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs", <a href="http://arxiv.org/abs/2110.05518">arxiv:2110.05518</a>
<li>Adam Gaier, David Ha, "Weight Agnostic Neural Networks", <a href="http://arxiv.org/abs/1906.04358">arxiv:1906.04358</a>
<li>Surya Ganguli, Dongsung Huh and Haim Sompolinsky, "Memory
traces in dynamical systems", <a href="http://dx.doi.org/10.1073/pnas.0804451105"><cite>Proceedings of the National Academy of Sciences</cite> (USA) <strong>105</strong> (2008): 18970--18975</a>
<li>Geoffrey Hinton and Terrence Sejnowski (eds.), <cite>Unsupervised Learning</cite> [A sort of "<cite>Neural Computation</cite>'s Greatest Hits" compilation]
<li>Anders Krogh and Jesper Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning", <a href="http://books.nips.cc/papers/files/nips07/0231.pdf"><cite>NIPS 7</cite> (1994): 231--238</a>
<li>Aaron Mishkin and Mert Pilanci, "Optimal Sets and Solution Paths of ReLU Networks" [<a href="https://web.stanford.edu/~pilanci/papers/optimal_sets_of_relu_networks_icml2023.pdf">PDF preprint via PRof. Pilanci</a>]
<li>Andrew M. Saxe, Yamini Bansal, Joel Dapello, Madhu Advani1 Artemy Kolchinsky, Brendan D. Tracey and David D. Cox, "On the information bottleneck theory of deep learning", <a href="https://doi.org/10.1088/1742-5468/ab3985"><cite>Journal of Statistical Mechanics: Theory and Experiment</cite> (2019) 124020</a> [This looks like trouble for an idea I found very promising]
<li>Yifei Wang, Jonathan Lacotte, Mert Pilanci, "The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural Networks: an Exact Characterization of the Optimal Solutions", <a href="http://arxiv.org/abs/2006.05900">arxiv:2006.05900</a>
<li>Mathukumalli Vidyasagar, <cite>A Theory of Learning and
Generalization: With Applications to Neural Networks and Control Systems</cite>
[Extensive discussion of the application
of <a href="learning-theory.html">statistical learning theory</a> to neural
networks, along with the purely computational difficulties. <a href="../weblog/algae-209-01.html#vidyasagar">Mini-review</a>]
<li>T. L. H Watkin, A. Rau and M. Biehl, "The Statistical Mechanics of
Learning a Rule," <a
href="http://link.aps.org/abstract/RMP/v65/p499"><cite>Reviews of Modern
Physics</cite> <strong>65</strong> (1993): 499--556</a>
<li>Achilleas Zapranis and Apostolos-Paul Refenes, <cite>Principles of
Neural Model Identification, Selection and Adequacy, with Applications to
Financial Econometrics</cite> [Their English is less than perfect, but they've
got very sound ideas about all the important topics]
</ul>
<ul>Recommended, <a href="https://pinboard.in/u:cshalizi/t:your_favorite_deep_neural_network_sucks/">"your favorite deep neural network sucks"</a>:
<li>Wieland Brendel, Matthias Bethge, "Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet", <a href="https://openreview.net/forum?id=SkfMWhAqYQ"><cite>International Conference on Learning Representations</cite> 2019</a>
<li>Brandon Carter, Siddhartha Jain, Jonas Mueller, David Gifford, "Overinterpretation reveals image classification model pathologies", <a href="http://arxiv.org/abs/2003.08907">arxiv:2003.08907</a>
<li>Maurizio Ferrari Dacrema, Paolo Cremonesi, Dietmar Jannach, "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches", <a href="http://arxiv.org/abs/1907.06902">arxiv:1907.06902</a>
<li>Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, Wieland Brendel, "ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness", <a href="https://openreview.net/forum?id=Bygh9j09KX"><cite>International Conference on Learning Representations</cite> 2019</a>
<li>Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, Tom Goldstein, "Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory", <a href="http://arxiv.org/abs/1910.00359">arxiv:1910.00359</a>
<li>Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, Austin R. Benson, "Combining Label Propagation and Simple Models Out-performs Graph Neural Networks", <a href="http://arxiv.org/abs/2010.13993">arxiv:2010.13993</a>
<li>Andee Kaplan, Daniel Nordman, Stephen Vardeman, "On the instability and degeneracy of deep learning models", <a href="http://arxiv.org/abs/1612.01159">arxiv:1612.01159</a>
<li>Gary Marcus
<ul>
<li>"Deep Learning: A Critical Appraisal", <a href="http://arxiv.org/abs/1801.00631">arxiv:1801.00631</a>
<li>"Innateness, AlphaZero, and Artificial Intelligence", <a href="http://arxiv.org/abs/1801.05667">arxiv:1801.05667</a>
</ul>
<li>Anh Nguyen, Jason Yosinski, Jeff Clune, "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images", <a href="http://arxiv.org/abs/1412.1897">arxiv:1412.1897</a>
<li>Filip Piekniewski, <a href="https://blog.piekniewski.info/2018/07/14/autopsy-dl-paper/">"Autopsy of a Deep Learning Paper"</a>
<li>Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler, "Memorization in Overparameterized Autoencoders", <a href="http://arxiv.org/abs/1810.10333">arxiv:1810.10333</a>
<li>Ali Rahimi and Benjamin Recht
<ul>
<li>"Reflections on Random Kitchen Sinks" <a href="http://www.argmin.net/2017/12/05/kitchen-sinks/">argmin blog, 5 December 2017</a>
<li>"An Addendum to Alchemy", <a href="http://www.argmin.net/2017/12/11/alchemy-addendum/">argmin blog, 11 December 2017</a>
</ul>
<li>Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus, "Intriguing properties of neural networks", <a href="http://arxiv.org/abs/1312.6199">arxiv:1312.6199</a>
<li>Tan Zhi-Xuan, Nishad Gothoskar, Falk Pollok, Dan Gutfreund, Joshua B. Tenenbaum, Vikash K. Mansinghka, "Solving the Baby Intuitions Benchmark with a Hierarchically Bayesian Theory of Mind", <a href="http://arxiv.org/abs/2208.02914">arxiv:2208.02914</a>
<li>Halbert White, "Learning in Artificial Neural Networks: A Statistical Perspective", <a href="https://doi.org/10.1162/neco.1989.1.4.425"><cite>Neural Computation</cite> <strong>1</strong> (1989): 425--464</a>
<li>Chengxi Ye, Matthew Evanusa, Hua He, Anton Mitrokhin, Tom Goldstein, James A. Yorke, Cornelia Fermüller, Yiannis Aloimonos, "Network Deconvolution". <a href="http://arxiv.org/abs/1905.11926">arxiv:1905.11926</a> [This is just doing principal components analysis, as invented in 1900]
<li>John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, Eric K. Oermann, "Confounding variables can degrade generalization performance of radiological deep learning models", <a href="https://doi.org/10.1371/journal.pmed.1002683"><cite>PLoS Medicine</cite> <strong>15</strong> (2018): e1002683</a>, <a href="http://arxiv.org/abs/1807.00431">arxiv:1807.00431</a> [<a href="https://jrzech.medium.com/what-are-radiological-deep-learning-models-actually-learning-f97a546c5b98">Dr. Zech's self-exposition</a>]
<li>Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization",
<a href="https://doi.org/10.1145/3446776"><cite>Communications of the ACM</cite> <strong>64</strong> (2021): 107--115</a> [previous version: <a href="http://arxiv.org/abs/1611.03530">arxiv:1611.03530</a>]
</ul>
<ul>Recommended, historical:
<li>Michael A. Arbib, <cite>Brains, Machines and Mathematics</cite>
[1964; a model of clarity in exposition and thought]
<li>Donald O. Hebb, <Cite>The Organization of Behavior: A
Neuropsychological Theory</cite>
<li>Warren S. McCulloch, <cite>Embodiments of Mind</citE>
</ul>
<ul>Modesty forbids me to recommend:
<li>CRS, <a href="http://bactra.org/weblog/2014-11-13-intriguing-properties.html">"Notes on 'Intriguing Properties of Neural Networks', and two other papers (2014)"</a> [On Szegedy et al., Nguyen et al., and Chalupka et al.]
<li>CRS, lecture notes on neural networks for CMU's <a href="http://www.stat.cmu.edu/~cshalizi/dm/">36-462, "methods of statistical learning"</a> (formerly 36-462, "data mining", and before that, 36-350, "data mining"). Currently (2022), this ls <a href="http://www.stat.cmu.edu/~cshalizi/dm/22/lectures/21/lecture-21.pdf">lecture 21</a>, but that might change the next time I teach it.
</ul>
<ul>To read, history and philosophy:
<li>William Bechtel and Adele Abrahamsen, <cite>Connectionism
and the Mind: Parallel Processing, Dynamics, and Evolution
in Networks</cite>
<li>William Bechtel and Robert C. Richardson, <cite><A href="http://pup.princeton.edu/titles/4971.html">Discovering
Complexity: Decomposition and Localization as Strategies in Scientific
Research</a></cite>
<li>Gardenfors, <cite>Conceptual Spaces: The Geometry of Thought</cite>
<li>Orit Halpern, "The Future Will Not Be Calculated: Neural Nets, Neoliberalism, and Reactionary Politics", <a href="https://doi.org/10.1086/717313"><cite>Critical Inquiry</cite> <strong>48</strong> (2022): 334--359</a>
<li>Andrea Loettgers, "Getting Abstract Mathematical Models in Touch
with
Nature", <a href="http://dx.doi.org/10.1017/S0269889706001153"><cite>Science in
Context</cite>
<strong>20</strong> (2007): 97--124</a> [Intellectual history of the Hopfield
model and its reception]
</ul>
<ul>To read, now-historical interest:
<li>Gail A. Carpenter and Stephen Grossberg (eds.), <cite>Pattern
Recognition by Self-Organizing Neural Networks</cite>
<li><a href="hayek.html">F. A. von Hayek</a>, <cite>The Sensory Order</cite>
<li>Jim W. Kay and D. M. Titterington (eds.), <cite>Statistics and
Neural Networks: Advances at the Interface</cite>
<li>McClelland and Rumelhart (ed.), <cite>Parallel Distributed
Processing</cite>
<li>Marvin Minsky and Seymour Papert, <cite>Perceptrons</cite>
<li>Kohonen, <cite><a href="self-organization.html">Self-organization</a> and associative
memory</cite>
</ul>
<ul>To read, not otherwise classified:
<li>Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare, "Deep Reinforcement Learning at the Edge of the Statistical Precipice", <a href="http://arxiv.org/abs/2108.13264">arxiv:2108.13264</a>
<li>Daniel Amit, <citE>Modelling Brain Function</cite>
<li>Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas, "Learning to learn by gradient descent by gradient descent", <a href="http://arxiv.org/abs/1606.04474">arxiv:1606.04474</a>
<li>Marco Antonio Armenta, Pierre-Marc Jodoin, "The Representation Theory of Neural Networks", <a href="http://arxiv.org/abs/2007.12213">arxiv:2007.12213</a>
<li>Pierre Baldi, <cite><a href="https://doi.org/10.1017/9781108955652">Deep Learning in Science</a></cite> [2021; Baldi has been around for more than a moment, and so I am interested to see what he makes of recent developments...]
<li>V. M. Becerra, F. R. Garces, S. J. Nasuto and W. Holderbaum, "An
Efficient Parameterization of Dynamic Neural Networks for Nonlinear System
Identification", <a href="http://dx.doi.org/10.1109/TNN.2005.849844"><cite>IEEE
Transactions on Neural Networks</cite> <strong>16</strong> (2005): 983--988</a>
<li><a href="http://yuggoth.ces.cwru.edu/beer/beer.html">Randall
Beer</a>, <cite>Intelligence as Adaptive Behavior: An Experiment in
Computational Neuroethology</cite>
<li>Hugues Berry and Mathias Quoy, "Structure and Dynamics of Random
Recurrent Neural Networks", <a
href="http://dx.doi.org/10.1177/105971230601400204"><cite>Adaptive
Behavior</cite> <strong>14</strong> (2006): 129--137</a>
<li>Dimitri P. Bertsekas and John N. Tsitsiklis, <cite>Neuro-Dynammic
Programming</cite>
<li>Michael Biehl, Reimer Kühn, Ion-Olimpiu Stamatescu, "Learning
structured data from unspecific reinforcement," <a
href="http://arxiv.org/abs/cond-mat/0001405">cond-mat/0001405</a>
<li>D. Bollé and P. Kozlowski, "On-line learning and
generalisation in coupled perceptrons," <a
href="http://arxiv.org/abs/cond-mat/0111493">cond-mat/0111493</a>
<li>Christoph Bunzmann, Michael Biehl, and Robert Urbanczik, "Efficient
training of multilayer perceptrons using principal component analysis", <a
href="http://dx.doi.org/10.1103/PhysRevE.72.026117"><cite>Physical Review
E</cite> <strong>72</strong> (2005): 026117</a>
<li>Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace, "Extracting Training Data from Diffusion Models", <a href="http://arxiv.org/abs/2301.13188">arxiv:2301.13188</a>
<li>Axel Cleeremans, <cite><a href="http://mitpress.mit.edu/9780262032056">Mechanisms of Implicit Learning: Connectionist Models of Sequence Processing</a></cite>
<li>Salvatore Cuomo, Vincenzo Schiano di Cola, Fabio Giampaolo, Gianluigi Rozza, Maziar Raissi, Francesco Piccialli, "Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next", <a href="http://arxiv.org/abs/2201.05624">arxiv:2201.05624</a>
<li>M. C. P. deSouto, T. B. Ludermir and W. R. deOliveira, "Equivalence
Between RAM-Based Neural Networks and Probabilistic Automata", <a
href="http://dx.doi.org/10.1109/TNN.2005.849838"><cite>IEEE Transactions on
Neural Networks</cite> <strong>16</strong> (2005): 996--999</a>
<li>Aniket Didolkar, Kshitij Gupta, Anirudh Goyal, Nitesh B. Gundavarapu, Alex Lamb, Nan Rosemary Ke, Yoshua Bengio, "Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning", <a href="http://arxiv.org/abs/2205.14794">arxiv:2205.14794</a>
<li>Keith L. Downing, <cite><a href="http://mitpress.mit.edu/9780262029131">Intelligence Emerging: Adaptivity and Search in Evolving Neural Systems</a></cite>
<li>Brandon Duderstadt, Hayden S. Helm, Carey E. Priebe, "Comparing Foundation Models using Data Kernels", <a href="http://arxiv.org/abs/2305.05126">arxiv:2305.05126</a>
<li>Liat Ein-Dor and Ido Kanter, "Confidence in prediction by neural
networks," <a href="https://doi.org/10.1103/PhysRevE.60.799"><cite>Physical Review E</citE> <strong>60</strong> (1999): 799--802</a>
<li>Chris Eliasmith, "A Unified Approach to Building and Controlling
Spiking Attractor Networks", <a
href="http://neco.mitpress.org/cgi/content/abstract/17/6/1276"><cite>Neural
Computation</cite> <strong>17</strong> (2005): 1276--1314</a>
<li>Elman et al., <cite>Rethinking Innateness</cite>
<li>Frank Emmert-Streib
<ul>
<li>"Self-organized annealing in laterally
inhibited neural networks shows power law decay", <a
href="http://arxiv.org/abs/cond-mat/0401633">cond-mat/0401633</a>
<li>"A Heterosynaptic Learning Rule for Neural Networks",
<a href="http://arxiv.org/abs/cond-mat/0608564">cond-mat/0608564</a>
</ul>
<li>Magnus Enquist and Stefano Ghirlanda, <cite><a href="http://pup.princeton.edu/titles/8107.html">Neural Networks and Animal Behavior</a></cite>
<li>Gary William Flake, "The Calculus of Jacobian Adaptation" [Not
confined to neural nets]
<li>Leonardo Franco, "A measure for the complexity of Boolean
functions related to their implementation in neural networks," <a
href="http://arxiv.org/abs/cond-mat/0111169">cond-mat/0111169</a>
<li>Jürgen Franke and Michael H. Neumann, "Bootstrapping Neural
Networks," <cite>Neural Computation</cite> <strong>12</strong> (2000):
1929--19949
<li>Ian Goodfellow, Yoshua Bengio and Aaron Courville,
<cite><a href="http://www.deeplearningbook.org">Deep Learning</a></cite>
<li>Michiel Hermans and Benjamin Schrauwen, "Recurrent Kernel Machines: Computing with Infinite Echo State Networks", <a href="http://dx.doi.org/10.1162/NECO_a_00200"><cite>Neural Computation</cite> <strong>24</strong> (2012): 104--133</a>
<li>Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, Andrea Frome, "What Do Compressed Deep Neural Networks Forget?", <a href="http://arxiv.org/abs/1911.05248">arxiv:1911.05248</a>
<li>Jun-ichi Inoue and A. C. C. Coolen, "Dynamics of on-line Hebbian
learning with structurally unrealizable restricted training sets," <a
href="http://arxiv.org/abs/cond-mat/0105004">cond-mat/0105004</a>
<li>Henrik Jacobsson, "Rule Extraction from Recurrent Neural Networks:
A Taxonomy and
Review", <a
href="http://neco.mitpress.org/cgi/content/abstract/17/6/1223"><cite>Neural
Computation</citE> <strong>17</strong> (2005): 1223--1263</a>
<li>Artem Kaznatcheev, Konrad Paul Kording, "Nothing makes sense in deep learning, except in the light of evolution", <a href="http://arxiv.org/abs/2205.10320">arxiv:2205.10320</a>
<li>Alon Keinan, Ben Sandbank, Claus C. Hilgetag, Isaac Meilijson and
Eytan Ruppin, "Fair Attribution of Functional Contribution in Artificial and
Biological Networks", <a
href="http://neco.mitpress.org/cgi/content/abstract/16/9/1887"><cite>Neural
Computation</cite> <strong>16</strong> (2004): 1887--1915</a>
<li>Beom Jun Kim, "Performance of networks of artificial neurons: The
role of clustering",
<a href="http://arxiv.org/abs/q-bio.NC/0402045">q-bio.NC/0402045</a>
<li>Konstantin Klemm, Stefan Bornholdt and Heinz Georg Schuster,
"Beyond Hebb: XOR and biological learning," <a
href="http://arxiv.org/abs/adap-org/9909005">adap-org/9909005</a>
<li>Michael Kohler, Adam Krzyzak, "Over-parametrized deep neural networks do not generalize well", <a href="http://arxiv.org/abs/1912.03925">arxiv:1912.03925</a>
<li>G. A. Kohring, "Artificial Neurons with Arbitrarily Complex
Internal Structures," <a
href="http://arXiv.org/abs/cs/0108009">cs.NE/0108009</a>
<li>John F. Kolen (ed.), <cite>A Field Guide to Dynamical Recurrent
Networks</cite>
<li>Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, Samuel J. Gershman, "Building Machines That Learn and Think Like People", <a href="http://arxiv.org/abs/1604.00289">arxiv:1604.00289</a>
<li>Jaeho Lee, Maxim Raginsky, "Learning finite-dimensional coding schemes with nonlinear reconstruction maps", <a href="http://arxiv.org/abs/1812.09658">arxiv:1812.09658</a>
<li>Hannes Leitgeb, "Interpreted Dynamical Systems and Qualitative
Laws: From Neural Networks to Evolutionary Systems", <a
href="http://dx.doi.org/10.1007/s11229-005-9086-5"><cite>Synthese</cite> <strong>146</strong>
(2005): 189--202</a> ["Interpreted dynamical systems are dynamical systems with
an additional interpretation mapping by which propositional formulas are
assigned to system states. The dynamics of such systems may be described in
terms of qualitative laws for which a satisfaction clause is defined. We show
that the systems C and CL of nonmonotonic logic are adequate with respect to
the corresponding description of the classes of interpreted ordered and
interpreted hierarchical systems, respectively"]
<li>Yonatan Loewenstein, and H. Sebastian Seung, "Operant matching is a
generic outcome of synaptic plasticity based on the covariance between reward
and neural activity", <a
href="http://dx.doi.org/10.1073/pnas.0505220103"><cite>Proceedings of the
National Academy of Sciences</cite> (USA) <strong>103</strong> (2006):
15224--15229</a> [The abstract promises a result about all possible neural
mechanisms having some fairly generic features; this is clearly the right way
to do theoretical neuroscience, but rarely done...]
<li>Wolfgang Maass (ed.), <citE>Pulsed Neural Networks</cite>
<li>Wolfgang Maass and Eduardo D. Sontag, "Neural Systems as Nonlinear
Filters," <citE>Neural Computation</cite> <strong>12</strong> (2000):
1743--1772
<li>M. S. Mainieri and R. Erichsen Jr, "Retrieval and Chaos in
Extremely Diluted Non-Monotonic Neural Networks," <a
href="http://arxiv.org/abs/cond-mat/0202097">cond-mat/0202097</a>
<li>Daniele Marinazzo, Mario Pellicoro, Sebastiano Stramaglia, "Causal
interactions and delays in a neuronal ensemble", <a
href="http://arxiv.org/abs/cond-mat/0609523">cond-mat/0609523</a>
<li>Luke Metz, C. Daniel Freeman, Niru Maheswaranathan, Jascha Sohl-Dickstein, "Training Learned Optimizers with Randomly Initialized Learned Optimizers", <a href="http://arxiv.org/abs/2101.07367">arxiv:2101.07367</a>
<li>Mika Meitz, "Statistical inference for generative adversarial networks", <a href="http://arxiv.org/abs/2104.10601">arxiv:2104.10601</a>
<li>Seiji Miyoshi, Kazuyuki Hara, and Masato Okada, "Analysis of
ensemble learning using simple perceptrons based on online learning theory",
<a href="http://dx.doi.org/10.1103/PhysRevE.71.036116"><cite>Physical Review
E</cite> <strong>71</strong> (2005): 036116</a>
<li>Javier R. Movellan, Paul Mineiro, and R. J. Williams, "A Monte
Carlo EM Approach for Partially Observable Diffusion Processes: Theory and
Applications to Neural Networks," <cite>Neural Computation</cite> <strong>14</strong> (20020: 1507--1544
<li>Randall C. O'Reilly, "Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Lsearning," <cite>Neural Computation</cite> <strong>13</strong> (2001): 1199--1241
<li>Steven Phillips, "Systematic Minds, Unsystematic Models:
Learning Transfer in Humans and Networks", <cite>Minds and Machines</cite>
<strong>9</strong> (1999): 383--398
<li>Guillermo Puebla, Jeffrey S. Bowers, "Can Deep Convolutional Neural Networks Learn Same-Different Relations?", <a href="https://doi.org/10.1101/2021.04.06.438551">bioRxiv 2021.04.06.438551</a>
<li>Suman Ravuri, Mélanie Rey, Shakir Mohamed, Marc Deisenroth, "Understanding Deep Generative Models with Generalized Empirical Likelihoods", <a href="http://arxiv.org/abs/2306.09780">arxiv:2306.09780</a>
<li>Tim Ráz, "Understanding Deep Learning with Statistical Relevance", <a href="https://doi.org/10.1017/psa.2021.12"><cite>Philosophy of Science</cite> <strong>89</strong> (2022): 20--41</a> [<a href="https://pinboard.in/u:cshalizi/b:2ae97d49f2dd">Comments</a>]
<li>Daniel A. Roberts and Sho Yaida, <cite><a href="https://doi.org/10.1017/9781009023405">The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks</a></cite>
<li>Patrick D. Roberts, "Dynamics of Temporal Learning Rules,"
<cite>Physical Review E</cite> <strong>62</strong> (2000): 4077--4082
<li>Fabrice Rossi, Brieuc Conan-Guez, "Functional Multi-Layer Perceptron: a Nonlinear Tool for Functional Data Analysis", <a href="http://arxiv.org/abs/0709.3642">arxiv:0709.3642</a>
<li>Fabrice Rossi, Nicolas Delannay, Brieuc Conan-Guez, Michel
Verleysen, "Representation of Functional Data in Neural
Networks", <a href="http://arxiv.org/abs/0709.3641">arxiv:0709.3641</a>
<li>Ines Samengo, "Independent neurons representing a finite set of
stimuli: dependence of the mutual information on the number of units sampled,"
<cite>Network: Computation in Neural Systems,</cite> <strong>12</strong>
(2000): 21--31, <a
href="http://arxiv.org/abs/cond-mat/0202023">cond-mat/0202023</a>
<li>Ines Samengo and Alessandro Treves, "Representational capacity of a
set of independent neurons," <a
href="http://arxiv.org/abs/cond-mat/0201588">cond-mat/0201588</a>
<li>Vitaly Schetinin and Anatoly Brazhnikov, "Diagnostic Rule
Extraction Using Neural Networks", <a
href="http://arxiv.org/abs/cs.NE/0504057">cs.NE/0504057</a>
<li>Philip Seliger, Stephen C. Young, and Lev S. Tsimring, "Plasticity
and learning in a network of coupled phase oscillators," <a
href="http://arxiv.org/abs/nlin.AO/0110044">nlin.AO/0110044</a>
<li>Paul Smolensky and Géraldine Legendre, <cite><a href="http://mitpress.mit.edu.edu/0-262-19528-3">The Harmonic
Mind: From Neural Computation to Optimality-Theoretic Grammar</a></ctie>
<li>Dietrich Stauffer and Amnon Aharony, "Efficient Hopfield pattern
recognition on a scale-free neural network," <a
href="http://arxiv.org/abs/cond-mat/0212601">cond-mat/0212601</a>
<li>Yan Sun, Qifan Song and Faming Liang, "Consistent Sparse Deep Learning: Theory and Computation", <a href="https://doi.org/10.1080/01621459.2021.1895175"><cite>Journal of the American Statistical Association</cite> <strong>117</strong> (2022): 1981--1995</a>
<li>Marc Toussaint
<ul>
<li>"On model selection and the disability of neural networks
to decompose tasks," <a
href="http://arxiv.org/abs/nlin.AO/0202038">nlin.AO/0202038</a>
<li>"A neural model for multi-expert architectures," <a
href="http://arxiv.org/abs/nlin.AO/0202039">nlin.AO/0202039</a>
</ul>
<li>T. Uezu and A. C. C. Coolen, "Hierarchical Self-Programming in
Recurrent Neural Networks," <a
href="http://arxiv.org/abs/cond-mat/0109099">cond-mat/0109099</a>
<li>Leslie G. Valiant
<ul>
<li><cite>Circuits of the Mind</cite>
<li>"Memorization and Association on a Realistic Neural Model",
<a href="http://neco.mitpress.org/cgi/content/abstract/17/3/527"><cite>Neural
Computation</cite> <strong>17</strong> (2005): 527--555</a>
</ul>
<li>Frank van der Velde and Marc de Kamps, "Neural blackboard
architectures of combinatorial structures in
cognition", <a
href="http://dx.doi.org/10.1017/S0140525X06009022"><cite>Behavioral and Brain
Sciences</cite> <strong>29</strong> (2006): 37--70</a> [+ peer commentary]
<li>Hiroshi Wakuya and Jacek M. Zurada, "Bi-directional computing
architecture for time series prediction," <citE>Neural Networks</cite>
<strong>14</strong> (2001): 1307--1321
<li>C. Xiang, S. Ding and T. H. Lee, "Geometrical Interpretation and
Architecture Selection of MLP", <cite>IEEE Transactions on Neural
Networks</cite> <strong>16</strong> (2005): 84--96 [MLP = multi-layer
perceptron]
</ul>
<ul>To read, conditional probability density estimation:
<li>Michael Feindt, "A Neural Bayesian Estimator for Conditional
Probability Densities", <a
href="http://arxiv.org/abs/physics/0402093">physics/0402093</a>
<li>Dirk Husmeier, <cite>Neural Networks for Conditional Probability
Estimation</citE>
</ul>
<ul>To read, applications of statistical physics to NNs (with thanks to Osame Kinouchi for recommendations):
<li>Nestor Caticha and Osame Kinouchi, "Time ordering in the evolution
of information processing and modulation systems," <cite>Philosophical
Magazine B</cite> <strong>77</strong> (1998): 1565--1574
<li>A. C. C. Coolen, "Statistical Mechanics of Recurrent Neural
Networks": part I, "Statics," <a
href="http://arxiv.org/abs/cond-mat/0006010">cond-mat/0006010</a> and part II,
"Dynamics," <a
href="http://arxiv.org/abs/cond-mat/0006011">cond-mat/0006011</a>
<li>A. C. C. Coolen, R. Kuehn, and P. Sollich, <cite>Theory of Neural
Information Processing Systems</cite>
<li>A. C. C. Coolen and D. Saad, "Dynamics of Learning with Restricted
Training Sets," <citE>Physical Review E</cite> <strong>62</strong> (2000):
5444--5487
<li>Mauro Copelli, Antonio C. Roque, Rodrigo F. Oliveira and Osame
Kinouchi, "Enhanced dynamic range in a sensory network of excitable
elements," <a
href="http://arxiv.org/abs/cond-mat/0112395">cond-mat/0112395</a>
<li>Valeria Del Prete and Alessandro Treves, "A theoretical model of
neuronal population coding of stimuli with both continuous and discrete
dimensions," <a
href="http://arxiv.org/abs/cond-mat/0103286">cond-mat/0103286</a>
<li>Viktor Dotsenko, <cite>Introduction to the Theory of Spin Glasses
and Neural Networks</cite>
<li>Ethan Dyer, Guy Gur-Ari, "Asymptotics of Wide Networks from Feynman Diagrams", <a href="http://arxiv.org/abs/1909.11304">arxiv:1909.11304</a>
<li>Andreas Engel and Christian P. L. Van den Broeck, <Cite>Statistical
Mechanics of Learning</citE>
<li>D. Herschkowitz and M. Opper, "Retarded Learning: Rigorous Results
from Statistical Mechanics," <a
href="http://arxiv.org/abs/cond-mat/0103275">cond-mat/0103275</a>
<li>Osame Kinouchi and Nestor Caticha, "Optimal Generalization in
Perceptrons," <cite>Journal of Physics A</cite> <strong>25</strong> (1992):
6243--6250
<li>W. Kinzel
<ul>
<li>"Statistical Physics of Neural Networks," <cite>Computer
Physics Communications,</cite> <strong>122</strong> (1999): 86--93
<li>"Phase transitions of neural networks,"
<cite>Philosophical Magazine B</cite> <strong>77</strong> (1998): 1455--1477
</ul>
<li>W. Kinzel, R. Metzler and I. Kanter, "Dynamics of Interacting
Neural Networks," <cite>Journal of Physica A</cite> <strong>33</strong> (2000):
L141--L147
<li>Krogh et al., <cite>Introduction to the Theory of Neural
Computation</cite>
<li>Patrick C. McGuire, Henrik Bohr, John W. Clark, Robert Haschke,
Chris Pershing and Johann Rafelski, "Threshold Disorder as a Source of Diverse
and Complex Behavior in Random Nets," <a
href="http://arxiv.org/abs/cond-mat/0202190">cond-mat/0202190</a>
<li>Richard Metzler, Wolfgang Kinzel, Liat Ein-Dor and Ido Kanter,
"Generation of anti-predictable time series by a Neural Network," <a
href="http://arxiv.org/abs/cond-mat/0011302">cond-mat/0011302</a>
<li>R. Metzler, W. Kinzel and I. Kanter, "Interacting Neural
Networks," <cite>Physical Review E</cite> <strong>62</strong> (2000):
2555--2565 [<a href="http://link.aps.org/abstract/PRE/v62/p2555">abstract</a>]
<li>Samy Tindel, "The stochastic calculus method for spin systems",
<a href="http://dx.doi.org/10%2E1214/009117904000000919"><cite>Annals of
Probability</cite> <strong>33</strong> (2005): 561--581</a>, <a
href="http://arxiv.org/abs/math.PR/0503652">math.PR/0503652</a> [One of the
kind of spin systems being perceptrons]
<li>Robert Urbanczik, "Statistical Physics of Feedforward Neural
Networks," <a href="http://arxiv.org/abs/cond-mat/0201530">cond-mat/0201530</a>
<li>W. A. van Leeuwen and Bastian Wemmenhove, "Learning by a neural net
in a noisy environment --- The pseudo-inverse solution revisited,"
<a href="http://arxiv.org/abs/cond-mat/0205550">cond-mat/0205550</a>
<li>Renato Vicente, Osame Kinouchi and Nestor Caticha, "Statistical
mechanics of online learning of drifting concepts: A variational approach,"
<cite>Machine Learning</cite> <strong>32</strong> (1998): 179--201 [<a
href="http://www.wkap.nl/oasis.htm/168704">abstract</a>]
</ul>
<ul>To read, why are neural networks so easy to fit by back-propagation/gradient descent? [See also <a href="symmetries-of-neural-networks.html">Symmetries of Neural Networks</a>]
<li>Frederik Benzing, "Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization", <a href="http://arxiv.org/abs/2201.12250">arxiv:2201.12250</a>
<li>Sourav Chatterjee, "Convergence of gradient descent for deep neural networks", <a href="http://arxiv.org/abs/2203.16462">arxiv:2203.16462</a>
<li>Michael I. Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, Manolis Zampetakis, "Deterministic Nonsmooth Nonconvex Optimization", <a href="http://arxiv.org/abs/2302.08300">arxiv:2302.08300</a>
<li>Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun, "Explorations on high dimensional landscapes", <a href="http://arxiv.org/abs/1412.6615">arxiv:1412.6615</a>
<li>Gal Vardi, "On the Implicit Bias in Deep-Learning Algorithms", <a href="http://arxiv.org/abs/2208.12591">arxiv:2208.12591</a>
</ul>