Adversarial Examples in Machine Learning

Last update: 07 Jul 2025 13:53
First version: 19 October 2018

Yet another inadequate place-holder. I will however point to my talk notes on the first paper on the subject...

This phenomenon is, to my mind, the most interesting thing to have come out of the recent revival of multi-layer connectionist models, a.k.a. "deep learning".

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus, "Intriguing properties of neural networks", arxiv:1312.6199
Anh Nguyen, Jason Yosinski, Jeff Clune, "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images", arxiv:1412.1897

Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, Justin Gilmer, "Adversarial Patch", arxiv:1712.09665
Sébastien Bubeck, Eric Price, Ilya Razenshteyn, "Adversarial examples from computational constraints",arxiv:1805.10204
Brandon Carter, Siddhartha Jain, Jonas Mueller, David Gifford, "Overinterpretation reveals image classification model pathologies", arxiv:2003.08907 [These aren't, strictly speaking, adversarial examples, but a different pathology, showing how to mask over 90% of standard training set images, resulting in completely uninterpretable scatterings of pixels, which standard neural nets still classify with high confidence]
Krzysztof Chalupka, Pietro Perona, Frederick Eberhardt, "Visual Causal Feature Learning", arxiv:1412.2309
Hang Gao, Tim Oates, "Universal Adversarial Perturbation for Text Classification", arxiv:1910.04618
Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, Stuart Russell, "Adversarial Policies: Attacking Deep Reinforcement Learning", arxiv:1905.10615
Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, "Explaining and Harnessing Adversarial Examples", arxiv:1412.6572
Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry, "Adversarial Examples Are Not Bugs, They Are Features", arxiv:1905.02175 [I'm not as convinced as they are that they've managed to create networks using only "robust" features that aren't vulnerable to new adversarial attacks. But I am convinced that they're able to identify non-robust features and show they generalize to the original data set. --- Immediately after reading the paper, I discovered an extensive multi-author discussion, with reply, which I have not had a chance to examine, but link here.]
Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel and Matthias Bethge, "Excessive Invariance Causes Adversarial Vulnerability", International Conference on Learning Representations 2019 [This is a good paper, but somewhat astonishingly does not cite Nguyen et al. 2014]
Juncheng Li, Frank R. Schmidt, J. Zico Kolter, "Adversarial camera stickers: A physical camera-based attack on deep learning systems", arxiv:1904.00759
Saeed Mahloujifar, Xiao Zhang, Mohammad Mahmoody, David Evans, "Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness", arxiv:1905.12202
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, Pascal Frossard, "Universal adversarial perturbations", arxiv:1610.08401
Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, J. Zico Kolter, "Denoised Smoothing: A Provable Defense for Pretrained Classifiers", arxiv:2003.01908 [Here's the idea, roughly: Start with a correctly classified image \( x \), so \( m(x) = c(x) \) where \( m(\cdot) \) is the classifier function and \( c(\cdot) \) is the true class. (Let's assume true class is a function of the image.) An adversarial example would be a small perturbation a such that \( m(x+a) = d \neq c(x) \). But the adversarial perturbations aren't just small, they're a very particular set, so if we add random noise \( R \) we typically get kicked back out of the adversarial set and back in to the pre-image of \( c(x) \), thus \( m(x+a+R) = c(x) \) with high probability. So it's somehow relying on adversarial perturbations being atypical; maybe not topological "meagre" in the strict sense, but presumably also not a generic set. When this works, it must tell us something about the geometry of the decision boundaries, but I'm not smart enough to say what. I should study this very carefully.]
Adi Shamir, Itay Safran, Eyal Ronen and Orr Dunkelman, "A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance", arxiv:1901.1086 [As they are careful to point out, they explain the existence of adversarial examples where a small number of pixels (or other basic features) are perturbed, but perhaps by arbitrarily large amounts. Also, their explanation relies on the over-all network being piecewise linear.]
Rohan Taori, Amog Kamsetty, Brenton Chu, Nikita Vemuri, "Targeted Adversarial Examples for Black Box Audio Systems", arxiv:1805.07820
Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson, "Universal and Transferable Adversarial Attacks on Aligned Language Models", arxiv:2307.15043 [Demos, etc.]

CRS, "Notes on 'Intriguing Properties of Neural Networks', and two other papers (2014)" [On Szegedy et al., Nguyen et al., and Chalupka et al.]

Nilesh A. Ahuja, Ibrahima Ndiour, Trushant Kalyanpur, Omesh Tickoo, "Probabilistic Modeling of Deep Features for Out-of-Distribution and Adversarial Detection", arxiv:1909.11786
Ulrich Aïvodji, Sébastien Gambs, Timon Ther, "GAMIN: An Adversarial Approach to Black-Box Model Inversion", arxiv:1909.11835
Devansh Arpit, Caiming Xiong, Richard Socher, "Entropy Penalty: Towards Generalization Beyond the IID Assumption", arxiv:1910.00164
Anish Athalye, Logan Engstrom, Andrew Ilyas, Kevin Kwok, "Synthesizing Robust Adversarial Examples", arxiv:1707.07397
Mikhail Belkin, Daniel Hsu, Partha Mitra, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", arxiv:1806.05161
Aleksandar Bojchevski, Stephan Günnemann, "Adversarial Attacks on Node Embeddings via Graph Poisoning", arxiv:1809.01093
Avishek Joey Bose, Andre Cianflone, William L. Hamilton, "Generalizable Adversarial Attacks Using Generative Models", arxiv:1905.10864
Nicholas Carlini, Ulfar Erlingsson, Nicolas Papernot, "Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications", arxiv:1910.13427
Heng Chang, Yu Rong, Tingyang Xu, Wenbing Huang, Honglei Zhang, Peng Cui, Wenwu Zhu, Junzhou Huang, "A Restricted Black-box Adversarial Framework Towards Attacking Graph Embedding Models", arxiv:1908.01297
Gilad Cohen, Guillermo Sapiro, Raja Giryes, "Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors", arxiv:1909.06872
Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, "Lower Bounds for Adversarially Robust PAC Learning", arxiv:1906.05815
Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J. Anders, Marcel Ackermann, Klaus-Robert Müller, Pan Kessel, "Explanations can be manipulated and geometry is to blame", arxiv:1906.07983
Gamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-Dickstein, "Adversarial Reprogramming of Neural Networks", arxiv:1806.11146
Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein, "Adversarial Examples that Fool both Computer Vision and Time-Limited Humans", arxiv:1802.08195
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Aleksander Madry, "Adversarial Robustness as a Prior for Learned Representations", arxiv:1906.00945
Samuel G. Finlayson, Hyung Won Chung, Isaac S. Kohane, Andrew L. Beam, "Adversarial Attacks Against Medical Deep Learning Systems", arxiv:1804.05296
Matthias Freiberger, Peter Kun, Anders Sundnes Lovlie, Sebastian Risi, "CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution", arxiv:2307.03798
Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. Dahl, "Motivating the Rules of the Game for Adversarial Example Research", arxiv:1807.06732 [Not sure how much this matters to me, since I'm not interested in these as security holes so much as windows on to what the networks are doing]
Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, Ian Goodfellow, "Adversarial Spheres", arxiv:1801.02774
Jan Philip Göpfert, André Artelt, Heiko Wersing, Barbara Hammer, "Adversarial attacks hidden in plain sight", arxiv:1902.09286
Melody Y. Guan, Gregory Valiant, "A Surprising Density of Illusionable Natural Speech", arxiv:1906.01040
Chuan Guo, Jacob R. Gardner, Yurong You, Andrew Gordon Wilson, Kilian Q. Weinberger, "Simple Black-box Adversarial Attacks", arxiv:1905.07121
Jiangfan Han, Xiaoyi Dong, Ruimao Zhang, Dongdong Chen, Weiming Zhang, Nenghai Yu, Ping Luo, Xiaogang Wang, "Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once", arxiv:1908.05185
Xintian Han, Yuxuan Hu, Luca Foschini, Larry Chinitz, Lior Jankelson, Rajesh Ranganath, "Adversarial Examples for Electrocardiograms", arxiv:1905.05163
Hangfeng He, Weijie J. Su, "The Local Elasticity of Neural Networks", arxiv:1910.06943
Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei Wei, Wen-Lian Hsu, Cho-Jui Hsieh, "Natural Adversarial Sentence Generation with Gradient-based Perturbation", arxiv:1909.04495
Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, Yu-Gang Jiang, "Black-box Adversarial Attacks on Video Recognition Models", arxiv:1904.05181
Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits, "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment", arxiv:1907.11932
Jason Jo, Yoshua Bengio, "Measuring the tendency of CNNs to Learn Surface Statistical Regularities", arxiv:1711.11561
Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein and J. D. Tygar, Adversarial Machine Learning [My judgment is that no book on ML has a more visually-apt cover than this one]
Ameya Joshi, Amitangshu Mukherjee, Soumik Sarkar, Chinmay Hegde, "Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers", arxiv:1904.08489
Karl M. Koerich, Mohammad Esmailpour, Sajjad Abdoli, Alceu S. Britto Jr., Alessandro L. Koerich, "Cross-Representation Transferability of Adversarial Perturbations: From Spectrograms to Audio Waveforms", arxiv:1910.10106
Cassidy Laidlaw, Soheil Feizi, "Functional Adversarial Attacks", arxiv:1906.00001
Alfred Laugros, Alice Caplier, Matthieu Ospici, "Are Adversarial Robustness and Common Perturbation Robustness Independent Attributes?", arxiv:1909.02436
Bai Li, Changyou Chen, Wenlin Wang, Lawrence Carin, "Certified Adversarial Robustness with Additive Gaussian Noise", arxiv:1809.03113 [I'm skeptical, but I haven't read it]
Daniel Liu, Ronald Yu, Hao Su, "Adversarial point perturbations on 3D objects", arxiv:1908.06062
Wenjian Luo, Chenwang Wu, Nan Zhou, Li Ni, "Random Directional Attack for Fooling Deep Neural Networks", arxiv:1908.02658
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu, "Towards Deep Learning Models Resistant to Adversarial Attacks", arxiv:1706.06083
Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, "SparseFool: a few pixels make a big difference", arxiv:1811.02248
Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar, "Adversarial Reprogramming of Text Classification Neural Networks", arxiv:1809.01829
Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, Farinaz Koushanfar, "Universal Adversarial Perturbations for Speech Recognition Systems", arxiv:1905.03828
Utku Ozbulak, Arnout Van Messem, Wesley De Neve, "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation", arxiv:1907.13124
Daniel Park, Haidar Khan, Bülent Yener, "Generation & Evaluation of Adversarial Examples for Malware Obfuscation", arxiv:1904.04802
Mikhail Pautov, Grigorii Melnikov, Edgar Kaziakhmedov, Klim Kireev, Aleksandr Petiushko, "On adversarial patches: real-world attack on ArcFace-100 face recognition system", arxiv:1910.07067
Dan Peng, Zizhan Zheng, Linhao Luo, Xiaofeng Zhang, "Structure Matters: Towards Generating Transferable Adversarial Images", arxiv:1910.09821
Aram-Alexandre Pooladian, Chris Finlay, Tim Hoheisel, Adam Oberman, "A principled approach for generating adversarial images under non-smooth dissimilarity metrics", arxiv:190801667
Shahbaz Rezaei, Xin Liu, "A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning", arxiv:1904.04334
Yaniv Romano, Aviad Aberdam, Jeremias Sulam, Michael Elad, "Adversarial Noise Attacks of Deep Learning Architectures -- Stability Analysis via Sparse Modeled Signals", arxiv:1805.11596
Andras Rozsa, Terrance E. Boult, "Improved Adversarial Robustness by Reducing Open Space Risk via Tent Activations", arxiv:1908.02435
Lea Schönherr, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa, "Robust Over-the-Air Adversarial Examples Against Automatic Speech Recognition Systems", arxiv:1908.01551
Ali Shahin Shamsabadi, Changjae Oh, Andrea Cavallaro, "EdgeFool: An Adversarial Image Enhancement Filter", arxiv:1910.12227
Shawn Shan, Emily Wenger, Bolun Wang, Bo Li, Haitao Zheng, Ben Y. Zhao, "Using Honeypots to Catch Adversarial Attacks on Neural Networks", arxiv:1904.08554
Chaomin Shen, Yaxin Peng, Guixu Zhang, Jinsong Fan, "Defending Against Adversarial Attacks by Suppressing the Largest Eigenvalue of Fisher Information Matrix", arxiv:1909.06137
Xupeng Shi, A. Adam Ding, "Understanding and Quantifying Adversarial Examples Existence in Linear Classification", arxiv:1910.12163
Jacob M. Springer, Melanie Mitchell, Garrett T. Kenyon, "Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers", arxiv:2102.05110
Vinod Subramanian, Emmanouil Benetos, Ning Xu, SKoT McDonald, Mark Sandler, "Adversarial Attacks in Sound Event Classification", arxiv:1907.02477
Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Julian Schrittwieser, Thomas Hubert, Michael Bowling, "Approximate exploitability: Learning a best response in large games", arxiv:2004.09677
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh, "Universal Adversarial Triggers for Attacking and Analyzing NLP", arxiv:1908.07125
Walt Woods, Jack Chen, Christof Teuscher, "Adversarial Explanations for Understanding Image Classification Decisions and Improved Neural Network Robustness", arxiv:1906.02896
Chang Xiao, Peilin Zhong, Changxi Zheng, "Resisting Adversarial Attacks by k-Winners-Take-All", arxiv:1905.10510
Qi Xuan, Jun Zheng, Lihong Chen, Shanqing Yu, Jinyin Chen, Dan Zhang, Qingpeng Zhang Member, "Unsupervised Euclidean Distance Attack on Network Embedding", arxiv:1905.11015
Jirong Yi, Hui Xie, Leixin Zhou, Xiaodong Wu, Weiyu Xu, Raghuraman Mudumbai, "Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks", arxiv:1905.11381
Xuwang Yin, Soheil Kolouri, Gustavo K. Rohde, "Divide-and-Conquer Adversarial Detection", arxiv:1905.11475
Tao Yu, Shengyuan Hu, Chuan Guo, Wei-Lun Chao, Kilian Q. Weinberger, "A New Defense Against Adversarial Images: Turning a Weakness into a Strength", arxiv:1910.07629
Yuan Zang, Chenghao Yang, Fanchao Qi, Zhiyuan Liu, Meng Zhang, Qun Liu, Maosong Sun, "Open the Boxes of Words: Incorporating Sememes into Textual Adversarial Attack", arxiv:1910.12196
Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, Liwei Wang, "Adversarially Robust Generalization Just Requires More Unlabeled Data", arxiv:1906.00555
Jiliang Zhang, Chen Li, "Adversarial Examples: Opportunities and Challenges", arxiv:1809.04790 [Review paper]
Pu Zhao, Sijia Liu, Pin-Yu Chen, Nghia Hoang, Kaidi Xu, Bhavya Kailkhura, Xue Lin, "On the Design of Black-box Adversarial Examples by Leveraging Gradient-free Optimization and Operator Splitting Method", arxiv:1907.11684