Notes on "Intriguing Properties of Neural Networks", and two other papers (2014)

\[ \DeclareMathOperator*{\argmax}{argmax} \]

Attention conservation notice: Slides full of bullet points are never good reading; why would you force yourself to read painfully obsolete slides (including even more painfully dated jokes) about a rapidly moving subject?

These are basically the slides I presented at CMU's Statistical Machine Learning Reading Group on 13 November 2014, on the first paper on what have come to be called "adversarial examples". It includes some notes I made after the group meeting on the Q-and-A, but I may not have properly credited (or understood) everyone's contributions even at the time. It also includes some even rougher notes about two relevant papers that came out the next month. Presented now ~~because I'm procrastinating preparing for my fall class~~ in the interest of the historical record.

Paper I: "Intriguing properties of neural networks" (Szegedy et al.)

Background

Nostalgia for the early 1990s: G. Hinton and company are poised to take over the world, NIPS is mad for neural networks, Clinton is running for President...
- Learning about neural networks for the first time in cog. sci. 1
- Apocrypha: a neural network supposed to distinguish tanks from trucks in aerial photographs actually learned about parking lots...
The models
- Multilayer perceptron \[ \phi(x) = (\phi_K \circ \phi_{K-1} ... \circ \phi_1)(x) \]
- This paper not concerned with training protocol, just following what others have done
The applications
- MNIST digit-recognition
- ImageNet
- \(10^7\) images from YouTube
- So we've got autoencoders, we've got convolutinal networks, we've got your favorite architecture and way of training it

Where Are the Semantics?

Claim (the literature, passim): individual hidden units in the network encode high-level semantic features
Support for the claim: look at the images which maximize the activation of units in some layer \[ \mathcal{X}_i = \argmax_{x}{\langle \phi(x), e_i \rangle} \] then do story-telling about what \(x \in \mathcal{X}_i\) have in common
The critique: pick a random unit vector \(v\) and do similar story-telling about \[ \mathcal{X}_{v} = \argmax_{x}{\langle \phi(x), v \rangle} \]
Real units
- Comment on the top left: white flowers?!?

Randomized pseudo-units

My assessment of the critique:
- Weakness of both claim and critique: people are very good at finding semantic features that link random objects (e.g., Zhu, Rogers and Gibson, "Human Rademacher Complexity", NIPS 2009 where random word lists come up with semantics like "related to motel service")
- How much semantics would people be able to read into a random collection of training \(x\)'s of equal size to \(\mathcal{X}_i\) or \(\mathcal{X}_v\)?
Implications of the critique: This is if anything a vindication for good old fashioned parallel distributed processing (as in the 1990s...)
- Doesn't matter as engineering...
- Also doesn't matter as a caricature of animal nervous systems: just as there are no grandmother cells in the brain, there is no "white flower" cell in the network, that's actually distributed across the network
Suggestions from the audience:
- Ryan Tibshirani: maybe the basis vectors are more "prototypical" than the random vectors? Referenced papers by Nina Baclan and by Robert Tibs. on prototype clustering
- Yu-Xing Wang: What about images corresponding to weights of hidden-layer neurons in convolutional networks? Me: I need to see what's going on in that paper before I can comment...

The Learned Classifier Isn't Perceptually Continuous

Claim: generalization based on semantics, or at least on features not local in the input space
"Adversarial examples"
- Find the smallest perturbation \(r\) we can apply to a given \(x\) to drive it to the desired class \(l\), i.e., smallest \(r\) s.t. \(f(x+r) = l\)
- Robustness: use the same perturbation on a different network (different hyperparameters, different training set) and see whether \(f^{\prime}(x+r) = l\) as well
Some adversarial images
- "All images in the right column are predicted to be an"ostrich, Struthio camelus"
- "The examples are strictly randomly chosen. There is not any postselection involved."

Some more adversarial images: car vs. not-car

Some adversarial digits
- RMS distortion \(0.063\) on a 0-1 scale, accuracy 0

Some non-adversarial digits
- Gaussian white noise, RMS distortion \(0.1\), accuracy 51% (!)
- \(\therefore\) it's not just the magnitude of the noise, it's something about the structure

Assessing the adversarial examples
- Obviously all adversial examples are mis-classified on the \(f\) they're built for
- But much higher mis-classification rates on other networks as well
- Much higher mis-classification rates than much bigger Gaussian noise
- To be fair, what would adversarial examples look like for any other classifier? For any other classifier using the kernel trick? Someone should try this for a support vector machine
Suggestions from the audience:
- Ryan: stability somehow? Changing one data point (say one dog for the adversarial dog) must be really changing the function; me: but isn't stability about the stability of the risk (in R^1), not the stability of the classifier (as a point in function space)? Risk averages over the measure again, smoother than the function as such. Martin Azizyan: so even if you always get wrong a dense subset, it wouldn't affect your risk. Me: right, though good luck proving that there is a countable dense subset of adversarial dogs. Ryan: senses of stability ought to be related somehow? Me, after: maybe, see London et al.
- David Choi: might this be some sort of generic phenomenon in very high dimensions? Since most of the volume of a closed body is very close to its surface, unless the geometric margin is very large, most points will be very close in the input space to crossing some decision boundary

How Can This Be?

The networks aren't over-fitting in any obvious sense
- CV shows they generalize to new instances from the same training pool
The paper suggests looking at "instability", i.e., the Lipschitz constant of the \(\phi\) mapping
- They can only upper-bound this
- And frankly it's not very persuasive as an answer (Lipschitz constant is a global property not a local one)
Speculative thought 1: what it is like to be an autoencoder?
- Adversarial examples are perceptually indistinguishable for humans but not for the networks
- \(\therefore\) human perceptual feature space is very different from the network's feature space
- What do adversarial examples look like for humans? (Possible psych. experiment with Mechanical Turk)
Speculative thoguht 2: "be careful what you wish for"
- Good generalization to the distribution generating instances
- This is \(n^{-1}\sum_{i=1}^{n}{\delta(x-x_i)}\), and \(n = 10^7 \ll\) dimension of the input space
- Big gaps around every point in the support
- \(\therefore\) training to generalize to this distribution doesn't care about small-scale continuity...
- IOW, the network is doing exactly what it's designed to do
- Do we need lots more than \(10^7\) images? How many does a baby see by the time it's a year old, anyway?
- What if we added absolutely-continuous noise to every image every time it was used? (Not enough, they use Gaussians)
- They do better when they fed in adversial examples into the training process (natch), but not clear whether that's not just shifting around the adversarial examples...

Paper II: "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images" (Nguyen, Yosinski and Clune)

Szegedy et al. looked for minimally-small perturbations which changed the predicted class
Nguyen et al. look for the largest possible perturbations which leave the class alone: "it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion)."
- Used evolutionary algorithms to find these examples, either "directly encoded" (genome = the pixels) or "indirectly encoded" (genome = a picture-producing network, with more structure in the image).
- Also used gradient ascent

"Evolved images that are unrecognizable to humans, but that state-of-the-art DNNs trained on ImageNet believe with \(\geq 99.6\)% certainty to be a familiar object. ... Images are either directly (top) or indirectly (bottom) encoded."

As you can tell from that figure, the evolved images look nothing at all like what human beings would think of as examples of those classes.
Results for MNIST are similar: in each plot, we see images classified (with \(\geq 99.9\)% confidence) as the digits given by the columns; the rows are 5 different evolutionary runs, with either direct (white-noise-y panel) or indirect (abstract-art-y panel) encodings.

Paper I says that imperceptible-to-humans perturbations make huge differences to neural networks; paper II says that huge-to-human perturbations can leave neural networks utterly indifferent. And this is, N.B., on networks that do well under cross-validation on the data.
Conclusion: Clearly human iso-classification¹ contours are nothing at all like DNN iso-classification contours.
It's entirely possible that any method which works with a high-dimensional feature space is vulnerable to similar attacks. I'm not sure how we'd do this with human beings as the classifiers (some sort of physiological measure of bogglement in place of the network's numerical confidence?), but it'd be in-principle straightforward enough to d with a common-or-garden support vector machine.
- Desimone et al. (1984), in a now-classic² paper in neuroscience, demonstrated that macaques had neurons which selectively responded to faces. They went on to show that (at least some) of those cells still responded to quite creepy altered faces, with their features re-arranged, blanking bars drawn across them, and even faces of a different species (as in this portion of their Figure 6A):

It'd seem in-principle possible to use the technique of this paper to evolve images for maximal response from these cells, if you could get the monkey to tolerate the apparatus for long enough.

Paper III: "Visual Causal Feature Learning" (Chalupka, Perona and Eberhardt)

The basic idea: defining what microscopic, low-level features cause an image to have macroscopic, high-level properties
- This is a counterfactual / interventional definition, not just about probabilistic association
- This gives an in-principle sound way of constraining a learning system to only pay attention to the causally relevant features
With observational data, you'd need to do causal inference
- They have a very clever way of doing that here, based on a coarsening theorem, which says that the correct causal model is (almost always) a coarsening of the optimally-predictive observational model
- This in turn gets into issues of predictive states...

I guess a more purely Greek phrase would be "isotaxon", but that's just a guess.^
i.e., I learned about this paper from Shallice and Cooper's excellent The Organisation of Mind, but felt dumb for not knowing about it before.^

Posted at August 06, 2019 15:17 | permanent link

Three-Toed Sloth

August 06, 2019