Notebooks

Neural Nets, Connectionism, Perceptrons, etc.

17 Jul 2024 11:25

Old notes from c. 2000

I'm mostly interested in them as a means of machine learning or statistical inference. I am particularly interested in their role as models of dynamical systems (via recurrent nets, generally), and as models of transduction.

I need to understand better how the analogy to spin glasses works, but then, I need to understand spin glasses better too.

The arguments that connectionist models are superior, for purposes of cognitive science, to more "symbolic" ones I find unconvincing. (Saying that they're more biologically realistic is like saying that cars are better models of animal locomotion than bicycles, because cars have four appendages in contact with the ground and not two.) This is not to say, of course, that some connectionist models of cognition aren't interesting, insightful and valid; but the same is true of many symbolic models, and there seems no compelling reason for abandoning the latter in favor of the former. (For more on this point, see Gary Marcus.) --- Of course a cognitive model which cannot be implemented in real brains must be rejected; connecting neurobiology to cognition can hardly be too ardently desired. The point is that the elements in connectionist models called "neurons" bear only the sketchiest resemblance to the real thing, and neural nets are no more than caricatures of real neuronal circuits. Sometimes sketchy resemblances and caricatures are enough to help us learn, which is why Hebb, McCulloch and Neural Computation are important for both connectionism and neurobiology.

Reflections circa 2016

I first learned about neural networks as an undergraduate in the early 1990s, when, judging by the press, Geoff Hinton and his students were going to take over the world. (In "Introduction to Cognitive Science" at Berkeley, we trained a three-layer perceptron to classify fictional characters as "Sharks" or "Jets" using back-propagation; I had no idea what those labels meant because I'd never seen West Side Story.) I then lived through neural nets virtually disappearing from the proceedings of Neural Information Processing Systems, and felt myself very retro for including neural nets the first time I taught data mining in 2006. (I dropped them by 2009.) The recent revival, as "deep learning", is a bit weird for me, especially since none of the public rhetoric has changed. The most interesting thing scientifically about the new wave is that it's lead to the discovery of adversarial examples, which I think we still don't understand very well at all. The most interesting thing meta-scientifically is how much the new wave of excitement about neural networks seems to be accompanied by forgetting earlier results, techniques, and baselines.

Reflections in early 2022

I would now actually say there are three scientifically interesting phenomena revealed by the current wave of interest in neural networks:

  1. Adversarial examples (as revealed by Szegedy et al.), and the converse phenomenon of extremely high confidence classification of nonsense images that have no humanly-perceptible resemblance to the class (e.g., Nguyen et al.);
  2. The ability to generalize to new instances by using humanly-irrelevant features like pixels at the edges of images (e.g., Carter et al.);
  3. The ability to generalize to new instances despite having the capacity to memorize random training data (e.g., Zhang et al.).

It's not at all clear how specific any of these are to neural networks. (See, Belkin's wonderful "Fit without Fear" for a status report on our progress in understanding my item (3) using other models, going back all the way to margin-based understandings of boosting.) It's also not clear how they inter-relate. But they are all clearly extremely important phenomena in machine learning which we do not yet understand, and really, really ought to understand.

I'd add that I still think there has been a remarkable regression of understanding of the past of our field and some hard-won lessons. When I hear people conflating "attention" in neural networks with attention in animals, I start muttering about "wishful mnemonics", and "did Drew McDermott live and fight in vain?" Similarly, when I hear graduate students, and even young professors, explaining that Mikolov et al. 2013 invented the idea of representing words by embedding them in a vector space, with proximity in the space tracking patterns of co-occurrence, as though latent semantic indexing (for instance) didn't date from the 1980s, I get kind of indignant. (Maybe the new embedding methods are better for your particular application than Good Old Fashioned Principal Components, or even than kernelized PCA, but argue that, dammit.)

I am quite prepared to believe that part of my reaction here is sour grapes, since deep learning swept all before it right around the time I got tenure, and I am now too inflexible to really jump on the bandwagon.

That is my opinion; and it is further my opinion that you kids should get off of my lawn.

25 July 2022: In the unlikely event you want to read pages and pages of me on neural networks, try my lecture notes. (That URL might change in the future.)

26 September 2022: Things I should learn more about (an incomplete list):

  1. "Transformer" architectures, specifically looking at them as ways of doing sequential probability estimation. (Now [2023] with their own irritated notebook.)
    If someone were to throw large-language-model-sized computing resources at a Good Old Fashioned SCFG learner, and/or a , what kind of performance would one get on the usual benchmarks? Heck, what if one used a truly capacious implementation of Lempel-Ziv? (You'd have to back out the probabilities from the LZ code-lengths, but we know how to do that.) [See same notebook.]
    On that note: could one build a GPT-esque program using Lempel-Ziv as the underlying model? Conversely, can we understand transformers as basically doing some sort of source coding? (The latter question is almost certainly addressed in the literature.) [Ditto.]
  2. What's going on with diffusion models for images? (I know, that's really vague.)
    While I am proposing brutally stupid experiments: Take a big labeled image data set and do latent semantic analysis on the labels, i.e., PCA on those bags-of-words, and do PCA on the images themselves. Learn a linear mapping from the word embedding space to the image embedding space. Now take a text query/prompt, map it into the word embedding space (i.e., project on to the word PCs), map that to the image space, and generate an image (i.e., take the appropriate linear combination of image PCs). The result will probably be a bit fuzzy but there should be ways to make it prettier... Of course, after that we kernelize the linear steps (in all possible combinations).
  3. I do not understand how "self-supervised" learning is supposed to differ from what we always did in un-supervised learning with (e.g.) mixture models, or for that matter how statisticians have "trained" autoregressions since about 1900.

Additional stray thought, recorded 27 May 2023: The loss landscape for a neural network, in terms of its weights, is usually very non-convex, so it's surprising that gradient descent (a.k.a. backpropagation) works so well. This leads me, unoriginally, to suspect that there is a lot of hidden structure in the optimization problem. Some of this is presumably just symmetries. But I do wonder if there isn't a way to reformulate it all as a convex program. (Though why gradient descent in the weights would then find it is a bit of a different question...) Alternately, maybe none of this is true and optimization is just radically easier than we thought; in that case I'd eat some crow, and be willing to embrace a lot more central planning in the future socialist commonwealth.

I presume there are scads of papers on all of these issues, so points are genuinely appreciated.


Notebooks: