August 14, 2019

Course Announcement: Data Mining (36-462/662), Fall 2019

For the first time in ten years, I find myself teaching data mining in the fall. This means I need to figure out what data mining is in 2019. Naturally, my first stab at a syllabus is based on what I thought data mining was in 2009. Perhaps it's changed too little; nonetheless, I'm feeling OK with it at the moment*. I am sure the thoughtful and constructive suggestions of the Internet will only reinforce this satisfaction.

--- Seriously, suggestions are welcome, except for suggesting that I teach about neural networks, which I deliberately omitted because I am an out-of-date stick-in-the-mud reasons**.

*: Though I am not done selecting readings from the textbook, the recommended books, and sundry articles --- those will however come before the respective classes. I have been teaching long enough to realize that most students, particularly in a class like this, will read just enough of the most emphatically required material to think they know how to do the assignments, but there are exceptions, and anecdotally even some of thoe majority come back to the material later, and benefit from pointers. ^

**: On the one hand, CMU (now) has plenty of well-attended classes on neural networks and deep learning, so what would one more add? On the other, my admittedly cranky opinion is that we have no idea why the new crop works better than the 1990s version, and it's not always clear that they do work better than good old-fashioned machine learning, so there.

Posted at August 14, 2019 17:17 | permanent link

August 06, 2019

Notes on "Intriguing Properties of Neural Networks", and two other papers (2014)

$\DeclareMathOperator*{\argmax}{argmax}$

Attention conservation notice: Slides full of bullet points are never good reading; why would you force yourself to read painfully obsolete slides (including even more painfully dated jokes) about a rapidly moving subject?

These are basically the slides I presented at CMU's Statistical Machine Learning Reading Group on 13 November 2014, on the first paper on what have come to be called "adversarial examples". It includes some notes I made after the group meeting on the Q-and-A, but I may not have properly credited (or understood) everyone's contributions even at the time. It also includes some even rougher notes about two relevant papers that came out the next month. Presented now because I'm procrastinating preparing for my fall class in the interest of the historical record.

Paper I: "Intriguing properties of neural networks" (Szegedy et al.)

Background

• Nostalgia for the early 1990s: G. Hinton and company are poised to take over the world, NIPS is mad for neural networks, Clinton is running for President...
• Learning about neural networks for the first time in cog. sci. 1
• Apocrypha: a neural network supposed to distinguish tanks from trucks in aerial photographs actually learned about parking lots...
• The models
• Multilayer perceptron $\phi(x) = (\phi_K \circ \phi_{K-1} ... \circ \phi_1)(x)$
• This paper not concerned with training protocol, just following what others have done
• The applications
• MNIST digit-recognition
• ImageNet
• $10^7$ images from YouTube
• So we've got autoencoders, we've got convolutinal networks, we've got your favorite architecture and way of training it

Where Are the Semantics?

• Claim (the literature, passim): individual hidden units in the network encode high-level semantic features
• Support for the claim: look at the images which maximize the activation of units in some layer $\mathcal{X}_i = \argmax_{x}{\langle \phi(x), e_i \rangle}$ then do story-telling about what $x \in \mathcal{X}_i$ have in common
• The critique: pick a random unit vector $v$ and do similar story-telling about $\mathcal{X}_{v} = \argmax_{x}{\langle \phi(x), v \rangle}$

• Real units
• Comment on the top left: white flowers?!?

• Randomized pseudo-units

• My assessment of the critique:
• Weakness of both claim and critique: people are very good at finding semantic features that link random objects (e.g., Zhu, Rogers and Gibson, "Human Rademacher Complexity", NIPS 2009 where random word lists come up with semantics like "related to motel service")
• How much semantics would people be able to read into a random collection of training $x$'s of equal size to $\mathcal{X}_i$ or $\mathcal{X}_v$?
• Implications of the critique: This is if anything a vindication for good old fashioned parallel distributed processing (as in the 1990s...)
• Doesn't matter as engineering...
• Also doesn't matter as a caricature of animal nervous systems: just as there are no grandmother cells in the brain, there is no "white flower" cell in the network, that's actually distributed across the network
• Suggestions from the audience:
• Ryan Tibshirani: maybe the basis vectors are more "prototypical" than the random vectors? Referenced papers by Nina Baclan and by Robert Tibs. on prototype clustering
• Yu-Xing Wang: What about images corresponding to weights of hidden-layer neurons in convolutional networks? Me: I need to see what's going on in that paper before I can comment...

The Learned Classifier Isn't Perceptually Continuous

• Claim: generalization based on semantics, or at least on features not local in the input space
• Find the smallest perturbation $r$ we can apply to a given $x$ to drive it to the desired class $l$, i.e., smallest $r$ s.t. $f(x+r) = l$
• Robustness: use the same perturbation on a different network (different hyperparameters, different training set) and see whether $f^{\prime}(x+r) = l$ as well