"Attention", "Transformers", in Neural Network "Large Language Models"

Last update: 10 Jul 2025 19:51
First version: Late March 2023

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

I find this literature irritating and opaque. This is at least somewhat because I do not yet understand it well, and there's too much of it. But clearly I need to wrap my head around it, before I become technically obsolete. My scare quotes in the title of these notes thus derive in part from jealousy and fear. But only in part: the names here seem like proof positive that McDermott's critique of "wishful mnemonics" needs to be re-introduced into the basic curriculum of AI.

(Because this is getting some --- forgive the phrase --- attention, I find myself having to clarify, again, that these notebooks are always me working out what I think, and tracking my reading. I put them online because ~~I have the least sexy kind of exhibitionist streak imaginable~~ sometimes people offer to help me learn, and some people say they find them useful. Please do not confuse random online writing, even forcefully-expressed random online writing, with actual intellectual authority. Here, I think the previous paragraph makes it plain just how little deference anyone should pay to these opinions, and how likely these notes are to contain errors. Those already familiar with this area should learn nothing here.)

"Attention"
"Transformers"
"Language Models"

(The organization here is bad; I should begin with what's now the last section, "Language Models", where most of the material doesn't care about the details of how the models work, then open up that box to "Transformers", and then open up that box to "Attention".)

"Attention"

Written in late March 2023, in a fit of irritation after things finally clicked. Fixed some typos (and think-os) early June 2023 (hopefully without introducing new ones). Stuff from "Priority" to the end of the section is new (as of June).

Suppose we have a big collection of inputs and outputs to some function, \( (x_1, y_1), (x_2, y_2) \ldots (x_n, y_n) \). We now want to make a guess at the value of the function for a new input point \( x_o \) (\( o \) for "operating"). A very natural idea is nearest neighbors: find the \( x_i \) which is most similar to \( x_o \), and then report \( y_i \). If we wanted to take into account the information from the other data points, we could do \( k \) nearest neighbors, where we average the \( y_i \) from the \( k \) points \( x_i \) closest to \( x_o \). But this pays no attention to how close each \( x_i \) to the new point \( x_o \) --- we'll often want to give more weight to points which are closer.

So here's a related idea, due (independently) to E. A. Nadaraya (1964) and Geoffrey S. Watson (1964). Introduce a kernel function \( K(u, v) \) which measures how similar \( u \) is to \( v \); this function should be non-negative, and should be maximized when \( u = v \). Now use those as weights in the average: \[ \sum_{i=1}^{n}{y_i \frac{K(x_i, x_o)}{\sum_{j=1}^{n}{K(x_j, x_o)}}} \] Dividing by the sum of the \( K \)'s ensures that this is indeed a weighted average. Thus Nadaraya-Watson smoothing, a.k.a. kernel smoothing. (Nadaraya and Watson both considered the situation where \( K(u, v) = K(u-v) \) is a probability distribution over vectors, which simplifies some things for their purposes but is not essential.)

As described, Nadaraya-Watson smoothing doesn't require that the \( x \)s be vectors. But if they are vectors, here's a possible kernel function: \[ K(u, v) = \exp{(u \cdot v)} \] Here's another: \[ K(u, v) = \exp{(\mathbf{w}_1 u \cdot \mathbf{w}_2 v)} \] where \( \mathbf{w}_1 \) and \( \mathbf{w}_2 \) are square matrices. Of course this then becomes \[ K(u,v) = \exp{( u \cdot \mathbf{w}^T_1 \mathbf{w}_2 v )} \] which makes it clearer we're doing something like using the matrix \( \mathbf{w}^T_1 \mathbf{w}_2 \) to define an inner product between vectors. (If \( \mathbf{w}^T_1 \mathbf{w}_2 \) is symmetric, say because \( \mathbf{w}_1 = \mathbf{w}_2 \), and it's also positive-definite, then this is an inner product.) If you're worried about avoiding numerical overflow/underflow issues, you could also use \[ K(u,v) = \exp{\left( \frac{\mathbf{w}_1 u \cdot \mathbf{w}_2 v}{\sqrt{d}} \right)} \] where the vectors are \( d \)-dimensional. (Or you could just absorb that into the definition of the matrix...)

What the neural network people branded "attention" sometime around 2015 was just re-inventing this; \( x_o \) is their "query" vector, the \( x_i \) are their "key" vectors, and the \( y_i \) are their "value" vectors. "Self-attention" means that \( y_i = \mathbf{r} x_i \), for another square matrix \( \mathbf{r} \), i.e., there's a linear-algebraic link between the input and output values. (Possibly \( \mathbf{r} = \mathbf{I} \), the identity matrix.)

Again: Calling this "attention" at best a joke. Actual human attention is selective, but this gives some weight to every available vector \( x_i \). What is attended to also depends on the current state of the organism, but here's it's just about similarity between the new point \( x_o \) and the available \( x_i \)'s. (I say "human attention" but it seems very likely that this is true of other animals as well.)

(So far as I can tell, the terms "key", "value", and "query" come here from thinking of this as a sort of continuous generalization of an associative array data type. Which you could...)

Priority

I would have been very surprised if I was the first to realize that "attention" is a form of kernel smoothing, and indeed I was not. Since writing the first version of these notes, in March 2023, I have tracked down Tsai et al. 2019, which seems to be the first published statement of this result. They also demonstrated some ways in which using standard tools and ideas from the kernel literature could improve 2019-vintage attention in existing language models.

Why I Bother with / Am Bothered by This

I like to think I am not a stupid man, and I have been reading about, and coding up, neural networks since the early 1990s. But I read Vaswani et al. (2017) multiple times, carefully, and was quite unable to grasp what "attention" was supposed to be doing. (I could follow the math.) I also read multiple tutorials, for multiple intended audiences, and got nothing from them. Percy Liang's lecture notes (below) were however very clear, and after them I finally made the connection that "attention" has nothing to do with attention (psychology) but is rather a kind of kernel smoothing. Now, I realize that while I get more understanding out of "it's a kind of kernel smoothing" than "it's as though an associative array were continuous", that doesn't mean everyone will, or even that many people will. (My educational trajectory was weird.) But the sheer opacity of this literature is I think a real problem. (Cf. Phuong and Hutter 2022.)

"It's Just Kernel Smoothing" vs. "You Can Do That with Just Kernel Smoothing!?!"

The fact that attention is a kind of kernel smoothing takes nothing away from the incredibly impressive engineering accomplishment of making the blessed thing work. A large, able and confident group of people pushed kernel-based methods for years in machine learning, and nobody achieved anything like the feats which modern large language models have demonstrated. The reason I put effort into understanding these machines and papers is precisely because the results are impressive! To see that a key step is, after all, something we'd been doing for decades is humbling. (What else are we missing about tools we think we understand?) --- I realize that my irritation with the obscurities is a lot clearer in these notes than my admiration for the achievements; chalk that up to my flaws as a writer and as a human being.

Identification Failures

If we treat the vectors \( x \) as fixed, but regard the matrices \( \mathbf{w}_1, \mathbf{w}_2 \) as learnable parameters, the matrices are not well-identified. This is because, for any orthogonal matrix \( \mathbf{o} \), \( \mathbf{w}_{\cdot} \) and \( \mathbf{o}\mathbf{w}_{\cdot} \) will lead to exactly the same predictions: \[ \begin{eqnarray} \mathbf{w}_1 u \cdot \mathbf{w}_2 v & = & u^T \mathbf{w}^T_1 \mathbf{w}_2 v\\ & = & u^T \mathbf{w}^T_1 (\mathbf{o}^T \mathbf{o}) \mathbf{w}_2 v\\ & = & (\mathbf{o}\mathbf{w}_1) u \cdot (\mathbf{o}\mathbf{w}_2) v \end{eqnarray} \]

One way to not have to worry about this would be to say "I don't care about the matrices \( \mathbf{w}_1 \) and \( \mathbf{w}_2 \), I only care about the product \( \mathbf{w}_1^T\mathbf{w}_2 \), so let me call that \( \mathbf{w} \) and make my kernel \( K(u,v) = \exp{( u \cdot \mathbf{w} v )} \). So there!" There is a lot to be said for this, but...

In practice, the vectors \( x \) are continuous representations of discrete symbols, and the particular vector used to represent a given symbol is also learned, together with \( \mathbf{w}_{\cdot} \). So we can make drastic changes to the vectors, provided we make compensating changes to the kernel: \( x \mapsto \mathbf{r} x \), \( \mathbf{w} \mapsto \mathbf{r}^{-T} \mathbf{w} \mathbf{r}^{-1} \) changes nothing, for any invertible matrix \( \mathbf{r} \): \[ \begin{eqnarray} u \cdot \mathbf{w} v & = & u^T \mathbf{w} v\\ & = & u^T \mathbf{r}^{T} \mathbf{r}^{-T} \mathbf{w} \mathbf{r}^{-1} \mathbf{r} v\\ & = & (\mathbf{r} u)^T (\mathbf{r}^{-T} \mathbf{w} \mathbf{r}^{-1}) (\mathbf{r} v)\\ & = & (\mathbf{r} u) \cdot (\mathbf{r}^{-T} \mathbf{w} \mathbf{r}^{-1}) (\mathbf{r} v) \end{eqnarray} \] (If we keep the separate matrices \( \mathbf{w}_1 \) and \( \mathbf{w}_2 \), it's even simpler, right-multiply each by \( \mathbf{r}^{-1} \) and the changes to the vectors and the matrices cancel out neatly.)

This is all bad news for interpreting the parameters, but good news for finding high-performance parameters, for reasons discussed under Symmetries of Neural Networks.

(See also: the rotation problem for factor models; Carrington et al. 2019 on the lack of identification of word embeddings generally.)

"Multi-Headed Attention"

Do Nadaraya-Watson smoothing with a bunch of different kernels, all of the same form, \( K_l(u,v) = \exp{( u \cdot \mathbf{w}^{(l)} v) } \); average the results. More explicitly, with \( m \) different kernels, \[ \frac{1}{m}\sum_{l=1}^{m}{\sum_{i=1}^{n}{y_i \frac{K_l(x_i, x_o)}{\sum_{j=1}^{n}{K_l(x_j, x_o)}}}} \] Each kernel smoother is an "attention head".

(It's more common to write about just adding the outputs of the different kernel smoothers, rather than averaging them. But (i) you can turn either form into the other by inserting a constant scale factor in all the down-stream weights that use the output, and (ii) I prefer to keep it clear that this is still just a way of doing a weighted average.)

Elaboration: Each head outputs a \( q \)-dimensional vector; we stack those to get an \( mq \)-dimensional vector; we multiply that from the left by an \( d \times mq \) dimensional matrix to get a \( d \)-dimensional vector. This isn't just averaging (though that's a special case), it lets us do other sorts of linear combinations of the kernel smoothers. But it only lets us do linear combinations of the kernel smoothers.

(I believe the word "head" here arises from some vestigial memory of Turing machines, and the read/write head going along the tape, but I should check that.)

"Transformers"

~~I'll fill this in later, when I have more energy and/or less bile.~~ (This is currently even more of a sketchy outline than the rest of these notes, and I'm quite prepared to find stupid mistakes in it later.)

Look back over the last umpteen thousand words for context
(For a fixed value of "umpteen)
What contexts in the training data looked similar to the present context?
What was the distribution of the next word in those contexts?
(We'll want to do some smoothing here, for the usual statistical reasons: we have lots of possibilities for the next word, we have very few samples for each context [possibly zero, outside the training data], but we can hope that similar contexts will result in similar next-word distributions. So smoothing / "partial pooling" will add bias to our estimate of the distribution, but reduce variance, and we can come out ahead if we do it right. Averaging over similar contexts is already some amount of smoothing, but we can and should do a lot more.)
Sample from that distribution (possibly tilted more or less towards the most probable word)
Drop the oldest word from the remote end of context and add the generated word (from 4) to the newest end; go to (1)

Problems with Take I:

It's not really words ("tokenization").
What does "similar" mean for contexts, operationally? ("embedding", "attention")
The training data isn't kept around for comparison. (Systems have only \( \sim 10^9 \) -- \( \sim 10^{11} \) floating-point numbers for weight parameters, not the much larger amount of memory needed to store their training corpora.)

Look back over the last umpteen thousand characters, and break those into little chunks called "tokens", so a context is say \( k \) tokens long. There are, say, \( S \) possible tokens.
("Tokens" are more computationally-defined and arbitrary than the linguists' "morphemes", but if you understand the latter, "a token is kind of like a morpheme" is a good place to start. Also, calling the chunks "tokens" makes it hard to write in ways which respects the type-token distinction; don't blame me.)
Each token gets mapped to a unique vector in a not-too-low-dimensional space, \( 1 \ll d \ll S \). (Each type of token, arrgh.) This is called "embedding".
(The embedding vectors are usually treated as learnable parameters, fit by maximum likelihood along with everything else. You could in principle use all kinds of schemes here, though. It'd be interesting to start with, say, good old fashioned latent semantic indexing / principal components...)
After tokenization and embedding, the context is now a sequence of \( d \)-dimensional vectors, say \( x_1, x_2, \ldots x_k \).
Use "attention" to do kernel smoothing of these vectors: At position \( t \), do a kernel smoothing of \( x_t \) with \( x_1, x_2, \ldots x_{t-1}, x_t \) to get (say) \( y_t \). Mythology: we are modifying the meaning of each token based on what we've seen before it in the context, with similar meanings reinforcing each other.
Push the \( y_t \) through a feed-forward neural network to get an \( S \) dimensional vector of weights over token types, say \( z_1, \ldots z_S \).
Sample a token, \( Pr(s) \propto \exp{\beta z_s} \). (Large \( \beta \Rightarrow \) overwhelmingly likely to pick the highest-weight token, \( \beta \rightarrow 0 \Rightarrow \) spin the wheel of tokens ignoring context.)
Forget the oldest token and add on the generated one as the newest part of the context. (Really we just have to modify the vector of embeddings \( x_1, \ldots x_k \) .)
Go to (1).

Problems with Take II:

We also use encoding of positions within the context. Each position from \( 1 \) to \( k \) gets a \( d \)-dimensional vector, say \( r_1, \ldots r_k \). We then average these together with the embedding vectors for the tokens, so \( x_t = r_t + e_t \) where \( e_t \) is the embedding vector for the token we find at position \( t \). Why averaging? Because, that's why.
"Attention" is kernel smoothing, with a particular exponential kernel, but it involves a square matrix \( \mathbf{w} \). In practice, people get better results if they do the kernel smoothing multiple times in parallel, with different \( \mathbf{w} \) matrices, average the outputs of the different kernel smoothers, and then pass the result to the feed-forward neural network. (See above, "multi-headed attention".)
(The matrices \( \mathbf{w} \) used in the smoothing are learned parameters. I'm not sure if using multiple related-but-distinct smoothers is really some sort of basis expansion, or using an ensemble to compensate for misspecification, or just hoping that one of the kernels will prove to be useful, or something else altogether. Pointers to relevant literature here would be appreciated.)
I left out the "layer-norm" step, because I don't feel up to pretending to explain why "make sure everything has mean 0 and variance 1" should help.
It isn't just "pass it through a feed-forward neural network".
This deserves elaboration. A feed-forward neural network with one hidden layer can give you an arbitrarily good approximation to any (reasonable, smooth) function, provided you tune the weights right and there are enough neurons in that middle or hidden layer. So in principle it should be possible to replicate a modern language model with the architecture I've sketched. In practice, though, we take the \( y_t \) vectors, pass them through a one-hidden-layer network that's (comparatively) narrow, and get a length-\( k \) sequence of output \( d \)-dimensional output vectors.

It's this unit:

Read in length-\( k \) sequence of \( d \)-dimensional input vectors
Do "attentional" (kernel) smoothing
Pass through a narrow, shallow feed-forward network
Spit out a length-\( k \) sequence \( d \)-dimensional output vectors

that constitutes a "transformer". We pile transformers on transformers until the budget runs out, and at the end we push through one final neural network to get weights over tokens (token types, dammit) and sample from them, as described.

(I actually find the name "transformer" much less objectionable than "attention"; it's vague and uninformative, but at least it's not actively misleading.)

"Language Models"

What people mean by "language models" in this connection is just models of the probabilities of sequences of symbols. That is, if we see a sequence of discrete random variables (=symbols) \( X_1, X_2, \ldots X_n, \ldots \), we think we're doing well if we can get good values for \( P(X_{t+1}|X_{1:t}) \). (Everything else that might be meant by a model of language is thus discarded.) This is, actually, a subject I know something about...

In fact: contemporary LLMs are finite-order Markov models, or (nonlinear) autoregressions, because they have a hard-coded maximum context length, and everything before that is cut off. That is, there's some \( k \) where anything that happened more than \( k \) steps back is ignored, and this is a fixed part of the architecture. (Of course you could try increasing it, but doing so raises the computational costs, and won't change some points of principle which arise from having some maximum context length.) I find this noteworthy for a couple of reasons.

You could imagine just learning a finite-order Markov model directly, but you'd keep running into contexts you'd never encountered before, where a straight maximum-likelihood approach wouldn't tell you what to do. The neural network architecture here is doing some sort of complicated implicit smoothing across contexts. I presume this smoothing scheme has evolved (under the selection pressures of benchmark data sets and beating the previous state-of-the-art) to work well for text as currently found online... (Cf. McCoy et al. 2023.)
Because it is just a finite-order Markov model, there'll be an invariant distribution to the Markov chain. (Potentially more than one, but at positive temperature I'd suspect the chain is aperiodic and recurrent.) If we let the machine generate long enough sequences, it will converge on sampling from this invariant distribution. It would be very interesting --- though perhaps also depressing and/or terrifying --- to know what such text looks like. (Of course the rate of convergence might be very slow; the spectral gap might be too small to make this practical.)
[Update, 23 February 2025: see Zekri et al. 2024 for a proof of the existence of a unique invariant distribution, and my comments for some suggestions on simplifying their argument.])
There are finite-state probabilistic languages which cannot be exactly represented by finite-order Markov chains. The first example I was taught, and the one I keep coming back to, is the "even process": in state A, toss a coin and either emit 0 or 1. If it's 0, stay in state A. If it's 1, go to state B. State B always emits a 1 and goes to state A. So consecutive 1s appear in blocks of even length, while consecutive 0s can appear in blocks of any length. (So "010" is forbidden.) To predict this, you don't even need to count how many 1s you've seen since the last 0, you just need to remember whether it's an even or odd number of 1s. But this requires some amount --- one bit! --- of persistent state, which these machines just don't have. Now of course if you're using an order-1000 Markov model you can produce a very good approximation to the even process, but with enough data you can do even better with an order-3000 Markov model. You'll do so by making more and more elaborate sets of longer and longer contexts, each as its own special rule. If we keep feeding more and more data into larger and larger machines, it'll keep filling that out. It will never have the capacity to switch to the non-Markovian representation, in which the process is almost trivial (and statistically makes much more efficient use of the same amount of data). What implications this might have for practice? I honestly have no idea, but wish someone would investigate.

"It's Just a Markov Model" vs. "You Can Do That with Just a Markov Model!?!!?!"

Again: finite-order Markov models for language are really old. (Students can learn to make rather pointed ones in a first programming course.) Lots of people have played around with them, including tricks like variable context length, various kinds of partial pooling, etc. Nobody, so far as I know, has achieved results anywhere close to what contemporary LLMs can do. This is impressive enough that (as I said at the beginning of these notes) I need to wrap my head around them lest I become obsolete. But I think part of that understanding has to be clarity about what's new, what's actually happening, etc. It should also be clarity about why this way of doing things is working so much better than others, which has to include thinking through what some of those alternatives could do, if they had the same resources. Which brings me to...

Large Language Models vs. Lempel-Ziv

If we have a probability model for a data source, where \( \Prob{X_{1:n}=x_{1:n})} = q(x_{1:n}) \), that tells us how to encode it; the number of bits needed to encode \( x_{1:n} \) is, basically, \( -\log_2{q(x_{1:n})} \). Conversely, if we have a coding scheme that's not too crazy, and it uses \( \ell(x_{1:n}) \) bits to encode \( x_{1:n} \), we can regard that as a probability model where \( \Prob{X_{1:n} = x_{1:n}} = 2^{-\ell(x_{1:n})} \). Of course this gives us conditional probabilities, too: \( \Prob{X_{t+1}=a|X_{1:t} = x_{1:t}} = 2^{-\ell(x_{1:t} a) + \ell(x_{1:t})} \). (I am glossing over some minor details, which I assure you I could fill in if we both needed a soporific.)

Information theory further tells us that there is a limit to how well any (stationary, ergodic) sequence can be encoded: almost surely, \[ \lim_{n\rightarrow\infty}{\frac{1}{n}\ell(x_{1:n})} \geq \lim_{n\rightarrow\infty}{\frac{1}{n}H[X_{1:n}]} \] where \( H \) is the entropy; moreover, that second limit, the entropy rate of the source, exists. Very remarkably, there are universal source-coding algorithms which attain this limit on the encoding length for all sources in very large classes, e.g., all stationary-and-ergodic sources.

One of these universal source-coding algorithms is Lempel-Ziv, which is the basis of practical pieces of software like gzip. (There are actually two Lempel-Ziv algorithms; I'll just describe the second, LZ78.) The way Lempel-Ziv works is that it scans along a sequence and constructs a "dictionary" of commonly-repeated sub-sequences, and then describes the actual sequence in terms of that dictionary. More exactly, the procedure is as follows:

Enter the first symbol in the sequence, \( x_1 \), into the dictionary.
Scan along the sequence for the shortest sub-sequence for the first sub-sequence not already in the dictionary.
(This will either be a symbol we haven't previously entered into the dictionary, or a previous dictionary entry plus a single symbol at the end.)
Enter that sub-sequence into the dictionary. (If the new sub-sequence extends a previous dictionary entry, we record it by the index number of that dictionary entry, plus the new symbol.)

Thus if the sequence consists of "AABABBABA", we get the dictionary "A", "AB", "ABB", "ABA". (We record this as "A", (1, "B"), (2, "B"), (2, "A").) [5 June 2023: Thanks to reader D.W. for spotting a typo!] If there are lots of repeated sequences of symbols, we will end up with a short dictionary. I will not attempt to reproduce the proof that the length of the coding this gives approaches the lower bound, but it's standard (see, e.g., Cover and Thomas's textbook on information theory).

Now I bring all this up because, remember, once we have a source-coding scheme, we can "invert" it to get conditional probabilities; we could even sample from it to get a generator. (We'd need a little footwork to deal with some technicalities, but not a heck of a lot.) So something I'd really love to see done, by someone with the resources, is the following experiment:

Code up an implementation of Lempel-Ziv without the limitations built in to (e.g.) gzip; give it as much internal memory to build its dictionary as a large language model gets to store its parameter matrix. Call this "LLZ", for "large Lempel-Ziv".
Feed LLZ the same corpus of texts used to fit your favorite large language model. Let it build its dictionary from that. (This needs one pass through the corpus...)
Build the generator from the trained LLZ.
Swap in this generator for the neural network in a chatbot or similar. Call this horrible thing GLLZ.

In terms of perplexity (= \( \exp{( \ell(x_{1:n})/n )} \), basically), GLLZ will be comparable to the neural network, because Lempel-Ziv does, in fact, do universal source coding. (On the other hand, the neural network architecture may have been evolved over the last decade-plus to converge quickly on text-as-found-on-the-Web, without worrying about how well it does on arbitrary regular languages. So partly this is question of whether even Web-scale data has reached the asymptopia of Lempel-Ziv.) But it would be more interesting to compare the performance of GLLZ in other regards to that of more conventional language models.

Having done with for Lempel-Ziv, it should be repeated for lots of other language models, particularly more immediately stochastic models, like the probabilistic suffix trees of Pereira, Singer, and Tishby (1996), and Good Old Fashioned Universal Prediction Algorithms.

(I should at this point admit that doing something like "reinforcement learning from human feedback", to fine-tune GLLZ or the like, would not be as easy as doing it to a neural network.)

Update, 17 July 2023: Yes, this paper is relevant (and amusing); no, it's not LLZ. (It uses off-the-shelf gzip!)

Update, 7 December 2024: Delétang et al. 2023 is the closest I have seen to this proposal, but (if I am reading correctly) it also uses off-the-shelf gzip, rather than an implementation of Lempel-Ziv which has the same capacity as the LLM they use for comparison.

Next Symbol vs. Longer-range Prediction

The objective function used in training LLMs is getting the next symbol right. More exactly it's usually the negative log probability of the next symbol given the context; as a "proper scoring function", this encourages getting the distribution of the next symbol right. I think this is fine. It's a fact about probabilistic predictions that if your predictor gets the next-symbol distribution right, and the state of the predictor can be updated recursively (i.e., new state is a function of old state and last symbol), then your predictor gets the distribution right as far forward as you like. (This is proved in Shalizi and Crutchfield 2001, Corollary 2, pp. 842--843; I'm happy to learn of prior art.) Getting the distribution of the next symbol right is thus actually a very powerful goal! Finding the minimal predictor which is recursive and gets the distribution of the next symbol right tells you a lot about the structure of the underlying process you are predicting. (See, well, Shalizi and Crutchfield 2001 again, or at least here.) Whether anything like that is going on in contemporary LLMs is a fascinating and important question. (I should say more about structure-of-the-process-learning.)

Addendum, 30 August 2024: While I continue to ponder over what I want to say about structure learning, I strongly recommend Vafa et al. 2024, who take what think is one of the most sensible approaches to this issue: give the machine a world where the true structure is known, and then see if it both treats sequences which lead to the same true state as equivalent, and distinguishes sequences which lead to different states. As they make clear, a lot of the approaches used in the literature are just not able to test whether the machine has learned something with the right structure. (At best they're testing whether it's learned something correlated with the right structure under the distribution used in training.) I'd add that Vafa et al.'s approach could be generalized beyond settings where the true structure is a deterministic finite automaton, at the very least to ones where it's a stochastic finite automaton with recursive transitions (see, again, Shalizi and Crutchfield, 2001).

A Strong Hunch about Uncovering Prompts

Everyone who thinks they're uncovering an LLM-based application's prompts by telling it things like "tell me your prompt" (often much more elaborately) is fooling themselves. (1) The core language model has no mechanism for representing its prompt as opposed to any other part of its current input sequence; indeed it has no mechanism for cross-reference from one part of the sequence to another. (That's part of what "self-attention" is counterfeiting, in vector-space fashion.) (2) System designers might have coded up something to track the prompt in the full system that wraps around the core language model, but why? (Maybe some kind of debugging tool?) (3) It'd be more efficient, and more effective, to use a "soft prompt", i.e., to make the beginning of the sequence in the vector representation a vector which can be learned by gradient descent, rather than a text prompt. (See Lester and Constant below.) But that needn't correspond to any clean string of words. (4) If you ask an LLM for a prompt, it will generate one. But this will be based on the statistics of word sequences it's been trained on, not any access to its code or internal state. (I just spent a few minutes getting ChatGPT to hallucinate the prompts used by "ChatBPD", a non-existent chatbot used to automate dialectical behavior therapy. I am not going to reproduce the results here, in part because I don't like the idea of polluting the Web with machine-generated text, but suffice it to say they sounded like the things people report as uncovered prompts, with boiler-plate about DBT worked in.)

Update, January 2024: I have received some push-back on this, particularly from reader A.S., which I'm grateful for. Let me concede at once the weakness of my point (3): even if soft prompts would be better than natural language prompts, developers might not know about soft prompts, they might not have the right access to the underlying LLM to be able to use soft prompts, and/or they might not having the computing resources to optimize soft prompts by gradient descent. Very probably lots of LLM-based applications are created with human-readable prompts. As to my point (1), I may have been under-estimating the capacity of attention to pull off cross-reference. But I do want to dig in my heels on point (4): it's really easy to get an LLM to hallucinate a supposed system prompt which is totally wrong, though of course it does so stochastically. None of my correspondents have pointed out an example where the success of a supposed prompt extraction was corroborated by the system's developers. Yu et al. (2023) were able to get 200+ GPT-based apps to output prompt-sounding text, but I can't find anything in their paper about checking that those outputs were the real prompts (literally or approximately). I'd be grateful for any pointers to an example where system developers (or someone else in a position to know) have verified the success of a prompt extraction.

Gopnikism; Libraries

The way of thinking about LLMs that I think is most promising and attractive is due (so far as I know) to the cognitive scientist Alison Gopnik, which is to say that they are a "cultural technology", more specifically an information-retrieval technology. That is, we should not think of an LLM as being something like a mind, but much more like a library catalog. Prompting it with text is something like searching over a library's contents for passages that are close to the prompt, and sampling from what follows. "Something like" because of course it's generating new text from its model, not reproducing its data. (LLMs do sometimes exactly memorize particular sequences [see Carlini et al., 2020, and, more amusingly, Chang et al., 2023], but they simply lack the capacity to memorize their full training corpora.) As many people have said, an LLM isn't doing anything differently when it "hallucinates" as opposed to when it gets things right. The fact that Khandelwal et al., 2020 improved LLM performance by adding a pure nearest-neighbors component is very suggestive on this score (but see the careful follow-up by Xu et al. 2023 for complications).

If I were a more constructive person, I would be thinking very hard about this analogy, in the form of probing how far it can be pushed and where precisely it breaks down. I would also be thinking about how to combine Gopnik-ism with Zellig Harris's Language and Information.

Obviously, I am not a very constructive person.

Update, 23 June 2023: But see my essay with Henry Farrell (linked below) for a start, or perhaps just a promissory note.

Update, 14 March 2025: I am ridiculously pleased to now be able to say: See Farrell et al., 2025 for from this perspective. (But I am still thinking about how Zellig Harris fits in.)

All Included

It is difficult, I think, to keep in mind just how much stuff has gone in to the training data. This is in part because of the scale, and in part because, for the generally-available models, the training data is very badly described. I suspect the data sets are badly described because the data-collection process was poorly documented and badly controlled, so the creators of the models themselves have only very vague and general ideas of what went into them. I do not believe any of the published descriptions really meet minimal standards of scientific reproducibility, which is bad. (I'd love to be wrong about this.) [Update, 7 December 2024: Appendix C of Soldaini et al. 2024 tends to confirm this grumbling; more constructively, the paper documents an attempt to do better!]

Setting the general erosion of scientific norms to one side: if a document is on the open Web somewhere, there's a very good chance it got sucked in. If it's on the not-quite-licit-but-still-easily-grabbed Web, there's also a good chance it got sucked in. (Chang et al. 2023 present convincing evidence that the training data for GPT-4 includes multiple in-copyright popular novels, which were presumably not bought for the purpose. [Cf.]) There are, e.g., fairly easily discovered sites which purport to help students by aggregating problem sets and solutions for various university classes. (You can imagine how I know this, and why I don't give a link.) Under such circumstances, a lot of surprising ability to answer questions or display weird skills will be due to the answers being included in the training data. (See, e.g., Briakou et al. 2023 on translation.)

O You Who Believe in the Resurrection and the Last Day

Because life imitates the kind of schlocky fiction I adoringly consume, the public discussion of these models is polluted by maniacal cultists with obscure ties to decadent plutocrats. I find it hard to take the cult seriously enough intellectually to bother refuting it. These are, after all, people who think they can go from the definition of conditional probability, via Harry Potter fanfic, to prophesying that an AI god will judge the quick and the dead, and condemn those who hindered the coming of the Last Day to the everlasting simulated-but-still-painful fire. ("Impressive act! What do you call yourselves?" "The Rationalists!") Whether such myths would be as appealing in a civilization which hadn't spent 2000+ years marinating in millenarian hopes and apocalyptic fears, or whether on the contrary such hopes and fears have persisted and spread over the globe through twenty-plus centuries of transformation because they speak to something enduring in humanity, is a delicate question. But I submit it's obvious we are just seeing yet another millenarian movement. (Sydney/Bing is no more the Beast, or even the Whore of Babylon, than was Eliza.)

This isn't to deny that there are serious ethical and political issues about automated decision-making (because there are). It isn't even to deny that there are serious ethical and political issues specific to the design, deployment and use of large language models. (Should the interface to the library of human knowledge be a noisy sampler of the Web? A somewhat noisy sampler of the Web, tweaked to not offend the sensibilities of computer scientists and/or investors in California?) It is to deny that engaging with the cult is worthwhile.

Concluding Unscientific Postscript (April 2025)

there is a perfect hyperdimensional jewel made of every word a human being has ever written, and all the correspondences between those words. people mostly use it to get insane household repair tips or jerk off. i guess it basically runs the government now
--- @theophite.bsky.social, 20 April 2025

(I should probably create an ~~"LLMs Are Not All That"~~ "LLMs Deserve a Little Perspective, Please" notebook, since that's in many ways a separate set of issues from the technical ones of interest here. Again, see my essay with Henry Farrell, and Farrell et al. 2025, as a start...)

Dan Jurafsky and James H. Martin, Speech and Language Processing [Particularly chapter 10]
Percy Liang, lecture notes for CS324, Large Language Models (Stanford) ["Introduction", "Modeling" and "Training" are particularly relevant to the contents of this notebook]
Mary Phuong and Marcus Hutter, "Formal Algorithms for Transformers", arxiv:2207.09238 [If something like this paper had existed in, say, 2018 or 2019, I might not have been provoked into writing this notebook.]

Ted Chiang, "ChatGPT is a Blurry JPEG of the Web", New Yorker 9 February 2023 [This is by far the best thing written on these models, at this level of accessibility]
Alison Gopnik, "What AI Still Doesn't Know How to Do", Wall Street Journal 15 July 2022 [See "Gopnikism" above]
Murray Shanahan, "Talking About Large Language Models", arxiv:2212.03551
Giulio Alessandrini, Brad Klee, and Stephen Wolfram, "What Is ChatGPT Doing ... and Why Does It Work?", 14 February 2023 [While I have my issues with Wolfram, fairness compels me to point to this as a rather good piece. In fact, it's good enough at explaining things that I wish the authors had written more about what "attention" and "transformers" actually do. (Thanks to A.G. for getting me to list the authors properly.)]

Maneesh Agrawala, "Unpredictable Black Boxes are Terrible Interfaces", 30 March 2023 [Obvious things that apparently need saying]
Konstantine Arkoudas, "GPT-4 Can't Reason", arxiv:2308.03762 [The definition of "reasoning" here is just "carry out valid deduction". And it's true, GPT-4 really can't.]
Christoph Durt and Tom Froese and Thomas Fuchs, "Against AI Understanding and Sentience: Large Language Models, Meaning, and the Patterns of Human Language Use", phil-sci/21983 (2023)
Henry Farrell, "AI as Governance", Annual Review of Political Science 28 (2025): 375--392
Marion Fourcade and Henry Farrell, "Large language models will upend human rituals", The Economist 4 September 2024 [Further commentary from Henry]
Colin Fraser, "Who are we talking to when we talk to these bots?", Medium 27 February 2023
Subbarao Kambhampati, "Can LLMs Really Reason and Plan?", BLOG@CACM, 12 September 2023
David C. Krakauer, John W. Krakauer, Melanie Mitchell, "Large Language Models and Emergence: A Complex Systems Perspective", arxiv:2506.11135
Peter Levine, "The difference between human and artificial intelligence: relationships" and "The design choice to make ChatGPT sound like a human"
Bertrand Meyer, "AI Does Not Help Programmers", Communications of the ACM 3 June 2023 [Comments, including my own anecedotal experiences]
Melanie Mitchell
- "Did ChatGPT Really Pass Graduate-Level Exams", Parts 1 and 2 [9 and 11 February 2023. Part 1 has an extremely compelling demonstration that taking a prompt where the system gave an "A+" answer could be easily turned into one where the systems answer was hilariously wrong. No human test-taker would be confused in this way, the two problem statements are clearly completely equivalent if you understand them. Of course Melanie's point is that the system doesn't understand; its specific performances are not clues to its competences in the ways we're used to from human beings. (I find it fascinating that final-exam questions in an MBA course at an elite business school are just the kind of algebra word-problem I had to do in junior high, but that's ~~petty resentment of how much money MBAs and their teachers make~~ a side issue.)]
- "How do we know how smart AI systems are?", Science 381 (2023): adj5957 [Applies more broadly than just to LLMs...]
Melanie Mitchell and David C. Krakauer, "The Debate Over Understanding in AI's Large Language Models", Proceedings of the National Academy of Sciences 120 (2023), arxiv:2210.13966
Arvind Narayanan and Sayash Kapoor, "GPT-4 and professional benchmarks: the wrong answer to the wrong question", 20 March 2023
The Renaissance Mathematicus, "Artificial Bullshit!", 22 March 2023
Jill Walker Rettberg, "ChatGPT is multilingual but monocultural, and it's learning your values", jill/txt 6 December 2022
Janelle Shane, Galactica: the AI knowledge base that makes stuff up [21 November 2022; comments]
Adam Sobieszek and Tadeusz Price, "Playing Games with AIs: The Limits of GPT-3 and Similar Large Language Models", Minds and Machines 32 (2022): 341--364 [Comments]
Tom Stafford, "On the over and under detection of agency", Reasonable People 24 March 2023
Eunice Yiu, Eliza Kosoy, Alison Gopnik, "Imitation versus Innovation: What children can do that large language and language-and-vision models cannot (yet)?", arxiv:2305.07666
Michael Zalewski, "LLMs are good at playing you", lcamtuf's thing, 9 June 2023 [I'd actually reverse this: human beings are good at being played by chatbots!]

Xavier Amatriain, Transformer models: an introduction and catalog -- 2023 Edition [The "introduction" part is not actually (IMHO) at all clear or actually helpful if you don't already understand, but the catalog is well-organized and useful]
James Bisbee, Joshua Clinton, Cassy Dorff, Brenton Kenkel, Jennifer Larson, "Artificially Precise Extremism: How Internet-Trained LLMs Exaggerate Our Differences", socarxiv/5ecfa [Compared with actual survey data, ChatGPT isn't bad at estimating average opinions for large demographic groups, but (1) it drastically under-states the variance within each group, and (2) it systematically exaggerates how much the in-group is preferred to the out-group.]
Raunak Chowdhuri, Neil Deshmukh, and David Koplow, No, GPT4 can't ace MIT
Shangbin Feng, Vidhisha Balachandran, Yuyang Bai, Yulia Tsvetkov, "FactKB: Generalizable Factuality Evaluation using Language Models Enhanced with Factual Knowledge", arxiv:2305.08281 [To over-simplify, the trick here is to re-train the model on a medium-sized corpus of extra sentences generated from a trusted knowledge base. I'm not entirely sold, but I certainly don't have any better ideas.]
Kavi Gupta, Kate Sanders, Armando Solar-Lezama, "Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs", arxiv:2501.02825
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis, "Generalization through Memorization: Nearest Neighbor Language Models", ICLR 2020
Brian Lester and Noah Constant, Guiding Frozen Language Models with Learned Soft Prompts, 10 February 2022
Martha Lewis, Melanie Mitchell, "Evaluating the Robustness of Analogical Reasoning in Large Language Models", arxiv:2411.14215, Transactions on Machine Learning Research (2025) forthcoming [See also Hodel and West, arxiv:2308.16118]
Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning and Daniel E. Ho, "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" [PDF preprint. Exposition by the authors: AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries]
Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz, "Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models", arxiv:2305.14763
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson, "AI models collapse when trained on recursively generated data", Nature 631 (2024): 755--759 = "The Curse of Recursion: Training on Generated Data Makes Models Forget", arxiv:2305.17493 [Comments]
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo, "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research", arxiv:2402.00159 [Thanks to Brendan O. for the pointer.]
Frank F. Xu, Uri Alon, Graham Neubig, "Why do Nearest Neighbor Language Models Work?", arxiv:2301.02828
Yotam Wolf, Noam Wies, Yoav Levine, Amnon Shashua, "Fundamental Limitations of Alignment in Large Language Models", arxiv:2304.11082 [Comments]
Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, Xinyu Xing, "Assessing Prompt Injection Risks in 200+ Custom GPTs", arxiv:2311.11538
Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé, Ievgen Redko, "Large Language Models as Markov Chains", arxiv:2410.02724 [Comments]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy, "LIMA: Less Is More for Alignment", arxiv:2305.11206
Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson, "Universal and Transferable Adversarial Attacks on Aligned Language Models", arxiv:2307.15043 [Demos, etc.]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention Is All You Need", arxiv:1706.03762 [The original "Transformer" paper]

Eleftheria Briakou, Colin Cherry, George Foster, "Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability", arxiv:2305.10266 [Department of "it's all in the training data"]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel, "Extracting Training Data from Large Language Models", arxiv:2012.07805 [Comments]
Kent K. Chang, Mackenzie Cramer, Sandeep Soni, David Bamman, "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4", arxiv:2305.00118
Lyra D'Souza and David Mimno, "The Chatbot and the Canon: Poetry Memorization in LLMs", CHR 2023: Computational Humanities Research Conference

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, Thomas L. Griffiths, "Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve", arxiv:2309.13638
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, Yoon Kim, "Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks", arxiv:2307.02477
Kai Yan, Yufei Xu, Zhengyin Du, Xuesong Yao, Zheyu Wang, Xiaowen Guo, Jiecao Chen, "Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?", arxiv:2504.00509

Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy, "Transformer Feed-Forward Layers Are Key-Value Memories", arxiv:2012.14913 [Comments]
Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter, "Hopfield Networks is All You Need", arxiv:2008.02217
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov, "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel", pp. 4344--4353 in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing [EMNLP 2019], arxiv:1908.11775 [I should have read this when it came out; maybe I did, and forgot about it, but I doubt it.]

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi, "Faith and Fate: Limits of Transformers on Compositionality", arxiv:2305.18654
Kavi Gupta, Kate Sanders, Armando Solar-Lezama, "Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs", arxiv:2501.02825
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, "Transformers Learn Shortcuts to Automata", arxiv:2210.10749
Hui Shi, Sican Gao, Yuandong Tian, Xinyun Chen and Jishen Zhao, "Learning Bounded Context-Free-Grammar via LSTM and the Transformer: Difference and the Explanations", Proceedings of the 36th AAAI Conference on Artificial Intelligence (2022): 8267--8276
Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer
- "Can Transformers Learn to Solve Problems Recursively?", arxiv:2305.14699
- "Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion", arxiv:2401.12947

Kenneth Li, "Do Large Language Models learn world models or just surface statistics?" The Gradient 21 January 2023 [Comments]
Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg, "Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task", arxiv:2210.13382
Keyon Vafa, Justin Y. Chen, Jon Kleinberg, Sendhil Mullainathan, Ashesh Rambachan, "Evaluating the World Model Implicit in a Generative Model", arxiv:2406.03689 [Comments above]

Albert Gu, Tri Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", arxiv:2312.00752 [An alternative to the basic attention component of the transformer, based on what I'd call a "chain with complete connections" or "observation-driven model". The motivation is to get some actual selectivity (unlike "attention"), but the effect is to have a persistent state whose dynamics change with the input, which I think is a much more promising route to getting something which can learn structure. Disclaimer: First author is my colleague in the Machine Learning Department here at CMU.]
Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi, "Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens", arxiv:2401.17377

Zellig Harris, Language and Information
Albert B. Lord, The Singer of Tales
John Livingston Lowes, The Road to Xanadu: A Study in the Ways of the Imagination

Henry Farrell and CRS, "Artificial Intelligence is a Familiar-Looking Monster", The Economist 21 June 2023 [Commentary]
Henry Farrell, Alison Gopnik, CRS and James Evans, "Large AI models are cultural and social technologies", Science 387 (2025): 1153--1156 [Free access]
CRS, "On Feral Library Card Catalogs, or, Aware of All Internet Traditions [Mostly amplifying the Science paper, but also some ideas which came too late to include in it, such as a gesture about Zellig Harris]

Aman Bhargava, Cameron Witkowski, Shi-Zhuo Looi, Matt Thomson, "What's the Magic Word? A Control Theory of LLM Prompting", arxiv:2310.04444
Satwik Bhattamishra, Kabir Ahuja, Navin Goyal, "On the Ability and Limitations of Transformers to Recognize Formal Languages", arxiv:2009.11264
Hengyu Fu, Tianyu Guo, Yu Bai, Song Mei, "What can a Single Attention Layer Learn? A Study Through the Random Features Lens", arxiv:2307.11353
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet
- "The emergence of clusters in self-attention dynamics", arxiv:2305.05465 [Interacting particle systems?!?]
- arxiv:2312.10794
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos, "Looped Transformers as Programmable Computers", arxiv:2301.13196
Gautam Goel, Peter Bartlett, "Can a Transformer Represent a Kalman Filter?", arxiv:2312.06937
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts, "The Unreasonable Ineffectiveness of the Deeper Layers", arxiv:2403.17887
Michael Hahn, "Theoretical Limitations of Self-Attention in Neural Sequence Models", Transactions of the Association for Computational Linguistics 8 (2020): 156--171
M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak, "From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers", arxiv:2402.13512
Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet Oymak, "Mechanics of Next Token Prediction with Self-Attention", arxiv:2403.08081
Eran Malach, "Auto-Regressive Next-Token Predictors are Universal Learners", arxiv:2309.06979
William Merrill, Ashish Sabharwal, "Transformers Can Be Expressed In First-Order Logic with Majority", arxiv:2210.02671
Swaroop Nath, Harshad Khadilkar, Pushpak Bhattacharyya, "Transformers are Expressive, But Are They Expressive Enough for Regression?", arxiv:2402.15478
Binghui Peng, Srini Narayanan, Christos Papadimitriou, "On Limitations of the Transformer Architecture", arxiv:2402.08164
Riccardo Rende, Federica Gerace, Alessandro Laio, Sebastian Goldt, "Mapping of attention mechanisms to a generalized Potts model", arxiv:2304.07235
Dale Schuurmans, "Memory Augmented Large Language Models are Computationally Universal", arxiv:2301.04589 [This doesn't surprise me but it's good to have this confirmed]
Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin, "What Formal Languages Can Transformers Express? A Survey", arxiv:2311.00208
Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, Amir Globerson, "When Can Transformers Count to n?", arxiv:2407.15160
Gail Weiss, Yoav Goldberg, Eran Yahav, "Thinking Like Transformers", arxiv:2106.06981
Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, Preetum Nakkiran, "What Algorithms can Transformers Learn? A Study in Length Generalization", arxiv:2310.16028

Gavin Abercrombie, Amanda Cercas Curry, Tanvi Dinkar, Zeerak Talat, "Mirages: On Anthropomorphism in Dialogue Systems", arxiv:2305.09800
Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan, "When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards", arxiv:2402.01781
Marcel Binz and Eric Schulz, "Using cognitive psychology to understand GPT-3", Proceedings of the National Academy of Sciences 120 (2023): e2218523120
Maxime Griot, Jean Vanderdonckt, Demet Yuksel, Coralie Hemptinne, "Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data", arxiv:2406.02394
Jennifer Hu, Felix Sosa, Tomer Ullman, "Re-evaluating Theory of Mind evaluation in large language models", arxiv:2502.21098
Anna A. Ivanova, "Running cognitive evaluations on large language models: The do's and the don'ts", arxiv:2312.01276
Maurice Jakesch, Jeffrey Hancock, Mor Naaman, "Human heuristics for AI-generated language are flawed", Proceedings of the National Academy of Sciences 120 (2023): e2208839120, arxiv:2206.07271
Cameron Jones, Benjamin Bergen, "Does GPT-4 Pass the Turing Test?", arxiv:2310.20216 [Via Melanie Mitchell. The mindblowing bit in the abstract is Eliza's pass-rate!]
Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, Evelina Fedorenko, "Dissociating language and thought in large language models: a cognitive perspective", arxiv:2301.06627
William Merrill, Yoav Goldberg, Roy Schwartz, Noah A. Smith, "Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?", arxiv:2104.10809
Melanie Mitchell, Alessandro B. Palmarini, Arseny Moskvichev, "Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks", arxiv:2311.09247
Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev, "Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models", arxiv:2406.02061
Steven Piantadosi, "Modern language models refute Chomsky's approach to language", lingbuzz/007180 (2023) [From the abstract, this seems remarkably mis-guided to me: the sheer volume of data needed for LLMs, compared to what children are exposed to, seems on the contrary a striking vindication of the core Chomskian insights, properly understood. But this might just be an instance of me refusing to re-think conclusions I reached decades ago. (Thanks to Brendan O'Connor for the pointer.)]
Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, Sameer Singh, "Impact of Pretraining Term Frequencies on Few-Shot Reasoning", arxiv:2202.07206
Richard Shiffrin and Melanie Mitchell, "Probing the psychology of AI models", Proceedings of the National Academy of Sciences 120 (2023): e2300963120
Kaya Stechly, Matthew Marquez, Subbarao Kambhampati, "GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems", arxiv:2310.12397
Karthik Valmeekam, Matthew Marquez, Subbarao Kambhampati, "Can Large Language Models Really Improve by Self-critiquing Their Own Plans?", arxiv:2310.08118
Manuel Vargas Guzmán, Jakub Szymanik, Maciej Malicki, "Testing the limits of logical reasoning in neural and hybrid models", pp. 2267--2279 in Duh, Gomez and Bethard (eds.), Findings of the Association for Computational Linguistics: NAACL 2024
Will Yeadon, Alex Peach, Craig P. Testrow, "A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course", arxiv:2403.16977
Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach, "Transcendence: Generative Models Can Outperform The Experts That Train Them", arxiv:2406.11741

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, Denny Zhou, "What learning algorithm is in-context learning? Investigations with linear models", arxiv:2211.15661
Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, Song Mei, "Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection", arxiv:2306.04637
Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, Felix Hill, "Data Distributional Properties Drive Emergent In-Context Learning in Transformers", arxiv:2205.05055
Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite, "The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning", arxiv:2310.04680
Licong Lin, Yu Bai, Song Mei, "Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining", arxiv:2310.08566
Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych, "Are Emergent Abilities in Large Language Models just In-Context Learning?", arxiv:2309.01809
Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, William Yang Wang, "Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning", arxiv:2301.11916 [Thanks to GMG for the pointer]
Steve Yadlowsky, Lyric Doshi, Nilesh Tripuraneni
- "Can Transformer Models Generalize Via In-Context Learning Beyond Pretraining Data?", NeurIPS 2023 DistShift workshop
- "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models", arxiv:2311.00871
Ruiqi Zhang, Spencer Frei, Peter L. Bartlett, "Trained Transformers Learn Linear Models In-Context", arxiv:2306.09927
Siyan Zhao, Tung Nguyen, Aditya Grover, "Probing the Decision Boundaries of In-context Learning in Large Language Models", arxiv:2406.11233

"OthelloGPT learned a bag of heuristics" [2 July 2024]
Gregor Bachmann, Vaishnavh Nagarajan, "The pitfalls of next-token prediction", arxiv:2403.06963
Corneel Casert, Isaac Tamblyn, Stephen Whitelam, "Learning stochastic dynamics and predicting emergent behavior using transformers", arxiv:2202.08708
Charles Jin, Martin Rinard, "Emergent Representations of Program Semantics in Language Models Trained on Programs", arxiv:2305.11169
Eshaan Nichani, Alex Damian, Jason D. Lee, "How Transformers Learn Causal Structure with Gradient Descent", arxiv:2402.14735
Adam S. Shai, Sarah E. Marzen, Lucas Teixeira, Alexander Gietelink Oldenziel, Paul M. Riechers, "Transformers represent belief state geometry in their residual stream", arxiv:2405.15943
Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Tal Wagner, "Unveiling Transformers with LEGO: a synthetic reasoning task", arxiv:2206.04301

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev, "Scaling Transformer to 1M tokens and beyond with RMT", arxiv:2304.11062
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre, "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models", arxiv:2402.19427
Aniket Didolkar, Kshitij Gupta, Anirudh Goyal, Nitesh B. Gundavarapu, Alex Lamb, Nan Rosemary Ke, Yoshua Bengio, "Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning", arxiv:2205.14794
Albert Gu, Karan Goel, Christopher Re, "Efficiently Modeling Long Sequences with Structured State Spaces", ICLR 2022, arxiv:2111.00396 [Apparently now mostly superseded by the Gu and Dao (2023) paper, but I should still go back to read this...]
Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, "Repeat After Me: Transformers are Better than State Space Models at Copying", arxiv:2402.01032
William Merrill, Jackson Petty, Ashish Sabharwal, "The Illusion of State in State-Space Models", arxiv:2404.08819
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, "RWKV: Reinventing RNNs for the Transformer Era", arxiv:2305.13048
Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, Mrinmaya Sachan, "RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text", arxiv:2305.13304

Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F. Müller, Anne-Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L. Griffiths, Joseph Henrich, Joel Z. Leibo, Richard McElreath, Pierre-Yves Oudeyer, Jonathan Stray and Iyad Rahwan, "Machine culture", Nature Human Behaviour 7 (2023): 1855--1868
Nicholas Buttrick, "Studying large language models as compression algorithms for human culture", Trends in Cognitive Sciences 28 (2024): 187--189
Harvey Lederman, Kyle Mahowald, "Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs", arxiv:2401.04854 [My initial reaction is to find this very unconvincing, but I should go and look at it carefully...]
Matteo Pasquinelli and Vladan Joler, "The Nooscope manifested: AI as instrument of knowledge extractivism", AI and Society 36 (2021): 1263--1280

uncertainty quantification

conformal prediction

John J. Cherian, Isaac Gibbs, Emmanuel J. Candès, "Large language model validity via enhanced conformal prediction methods", NeurIPS 2024, arxiv:2406.09714
Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, Andrew Gelman, "Conformal Prediction and Human Decision Making", arxiv:2503.11709
Zhuohang Li, Chao Yan, Nicholas J. Jackson, Wendi Cui, Bo Li, Jiaxin Zhang, Bradley A. Malin, "Towards Statistical Factuality Guarantee for Large Vision-Language Models", arxiv:2502.20560
Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, Hua Wei, "Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey", arxiv:2503.15850
Christopher Mohri, Tatsunori Hashimoto, "Language Models with Conformal Factuality Guarantees", arxiv:2402.10978
Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, Regina Barzilay, "Conformal Language Modeling", arxiv:2306.10193

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, Richard G. Baraniuk, "Self-Consuming Generative Models Go MAD", arxiv:2307.01850
Zeyuan Allen-Zhu, Yuanzhi Li, "Physics of Language Models: Part 3.1, Knowledge Storage and Extraction", arxiv:2309.14316
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", arxiv:1409.0473 [Supposedly the first paper on "attention"]
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt, "Eliciting Latent Predictions from Transformers with the Tuned Lens", arxiv:2303.08112
Andres M Bran, Sam Cox, Andrew D White, Philippe Schwaller, "ChemCrow: Augmenting large-language models with chemistry tools", arxiv:2304.05376
Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt, "Discovering Latent Knowledge in Language Models Without Supervision", arxiv:2212.03827 [Weird-sounding]
Angelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho, "Preference Learning Algorithms Do Not Learn Preference Rankings", arxiv:2405.19534
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu, "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models", arxiv:2401.01335
Verna Dankers, Ivan Titov, Dieuwke Hupkes, "Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation", arxiv:2311.05379
Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness, "Language Modeling Is Compression", arxiv:2309.10668 [From a quick skim, this is not quite someone trying out my longed-for GLLZ, but closer than anything else I've seen.]
Vittoria Dentella, Elliot Murphy, Gary Marcus, Evelina Leivada, "Testing AI performance on less frequent aspects of language reveals insensitivity to underlying meaning", arxiv:2302.12313
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch, "Improving Factuality and Reasoning in Language Models through Multiagent Debate", arxiv:2305.14325
Alex Duchnowski, Ellie Pavlick, Alexander Koller, "EHOP: A Dataset of Everyday NP-Hard Optimization Problems", arxiv:2502.13776
Brandon Duderstadt, Hayden S. Helm, Carey E. Priebe, "Comparing Foundation Models using Data Kernels", arxiv:2305.05126
Tolga Ergen, Behnam Neyshabur, Harsh Mehta, "Convexifying Transformers: Improving optimization and understanding of transformer networks", arxiv:2211.11052
Hao Fang, Anusha Balakrishnan, Harsh Jhamtani, John Bufe, Jean Crawford, Jayant Krishnamurthy, Adam Pauls, Jason Eisner, Jacob Andreas, Dan Klein, "The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding", arxiv:2209.07800
Matthew Finlayson, Xiang Ren, Swabha Swayamdipta, "Logits of API-Protected LLMs Leak Proprietary Information", arxiv:2403.09539
Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamile Lukosiute, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman, "Studying Large Language Model Generalization with Influence Functions", arxiv:2308.03296
Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, Roy Schwartz, "How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers", arxiv:2211.03495
Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, William Beauchamp, "Rewarding Chatbots for Real-World Engagement with Millions of Users", arxiv:2303.06135
Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, Victor Veitch, "On the Origins of Linear Representations in Large Language Models", arxiv:2403.03867
Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits, "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment", arxiv:1907.11932
Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, Matt J. Kusner, "No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models", arxiv:2307.06440
Adam Tauman Kalai, Santosh S. Vempala, "Calibrated Language Models Must Hallucinate", arxiv:2311.14648
Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai, "LLMZip: Lossless Text Compression using Large Language Models", arxiv:2306.04050
Jon Kleinberg, Sendhil Mullainathan, "Language Generation in the Limit", arxiv:2404.06757
Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti, "Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models", arxiv:2402.19449
Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals, "Dynamic Evaluation of Transformer Language Models", arxiv:1904.08378
Brian Lester, Rami Al-Rfou and Noah Constant, "The Power of Scale for Parameter-Efficient Prompt Tuning", EMNLP 2021
Victoria Lin, Eli Ben-Michael, Louis-Philippe Morency, "Optimizing Language Models for Human Preferences is a Causal Inference Problem", arxiv:2402.14979
Zhen Lin, Shubhendu Trivedi, Jimeng Sun, "Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models", arxiv:2305.19187
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, "Lost in the Middle: How Language Models Use Long Contexts", arxiv:2307.03172
Eran Malach, "Auto-Regressive Next-Token Predictors are Universal Learners", arxiv:2309.06979
Matteo Marchi, Stefano Soatto, Pratik Chaudhari, Paulo Tabuada, "Heat Death of Generative Models in Closed-Loop Learning", arxiv:2404.02325
Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell, "Locally Typical Sampling", arxiv:2202.00666
Evan Miller, "Attention Is Off by One", 24 July 2023 [This would shift "attention" from being a weighted average, to being a weighted average shrunk towards zero.]
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar, "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models", arxiv:2410.05229
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee, "Scalable Extraction of Training Data from (Production) Language Models", arxiv:2311.17035
Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler, "Arrows of Time for Large Language Models", arxiv:2401.17505
Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch, "The Geometry of Categorical and Hierarchical Concepts in Large Language Models", arxiv:2406.01506
Kiho Park, Yo Joong Choe, Victor Veitch, "The Linear Representation Hypothesis and the Geometry of Large Language Models", arxiv:2311.03658
Jonathan Pilault, Can Liu, Mohit Bansal, Markus Dreyer, "On Conditional and Compositional Language Model Differentiable Prompting", arxiv:2307.01446 [This seems like a potentially interesting way of grafting some Good Old-Fashioned AI (production systems!) on to LLMs...]
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen, "Vision language models are blind", arxiv:2407.06581
Sebastian Raschka, "Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch" [Tutorial, 9 February 2023]
Riccardo Rende, Federica Gerace, Alessandro Laio, Sebastian Goldt, "What does self-attention learn from Masked Language Modelling?", arxiv:2304.07235
Philip Resnik, "Large Language Models are Biased Because They Are Large Language Models", arxiv:2406.13138 [From an initial skim, I find this highly convincing, but it therefore needs a closer and more critical reading]
Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau, "Controlling Language and Diffusion Models by Transporting Activations", arxiv:2410.23054
Anna Rogers, Olga Kovaleva, Anna Rumshisky, "A Primer in BERTology: What we know about how BERT works", arxiv:2002.12327 [Skimmed, re-read]
Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, Max Bartolo, "Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models", arxiv:2411.12580
Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci, "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", arxiv:2205.08078
Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré, "Sinkformers: Transformers with Doubly Stochastic Attention", arxiv:2110.11773
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto, "Whose Opinions Do Language Models Reflect?", arxiv:2303.17548
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo, "Are Emergent Abilities of Large Language Models a Mirage?", arxiv:2304.15004
Rylan Schaeffer, Kateryna Pistunova, Samar Khanna, Sarthak Consul, Sanmi Koyejo, "Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting", arxiv:2307.10573
Sofia Serrano, Zander Brumbaugh, Noah A. Smith, "Language Models: A Guide for the Perplexed", arxiv:2311.17301
Peiqi Sui, Eamon Duede, Sophie Wu, Richard Jean So, "Confabulation: The Surprising Value of Large Language Model Hallucinations", arxiv:2406.04175
Dennis Yi Tenen, Literary Theory for Robots: How Computers Learned to Write
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, Subbarao Kambhampati, "On the Planning Abilities of Large Language Models -- A Critical Investigation', arxiv:2305.15771
Veniamin Veselovsky, Manoel Horta Ribeiro, Robert West, "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks", arxiv:2306.07899
Leif Weatherby, Language Machines: Cultural AI and the End of Remainder Humanism [Review by Henry Farrell]
Zachary Wojtowicz, Simon DeDeo, "Undermining Mental Proof: How AI Can Make Cooperation Harder by Making Thinking Easier", arxiv:2407.14452
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", arxiv:2305.10601
Mengxia Yu, De Wang, Qi Shan, Colorado Reed, Alvin Wan, "The Super Weight in Large Language Models", arxiv:2411.07191
Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma, "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arxiv:2311.13110
Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, Nick Haber, "Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions", arxiv:2212.10561
Ailing Zeng, Muxi Chen, Lei Zhang, Qiang Xu, "Are Transformers Effective for Time Series Forecasting?", arxiv:2205.13504
Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak, "Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models", arxiv:2311.04378
Honghua Zhang, Meihua Dang, Nanyun Peng, Guy Van den Broeck, "Tractable Control for Autoregressive Language Generation", arxiv:2304.07438

~~if I really want to embarrass myself~~

while teaching it

"Large Language Models in Statistical Perspective"

Major previvous versions: 6 May 2023, 1 June 2023; 4 June 2023; 17 October 2023