## Adversarial Examples in Machine Learning

*25 Feb 2024 14:35*

Yet another inadequate place-holder. I will however point to my talk notes on the first paper on the subject...

This phenomenon is, to my mind, the most interesting thing to have come out of the recent revival of multi-layer connectionist models, a.k.a. "deep learning".

- Recommended, most important:
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergus, "Intriguing properties of neural networks", arxiv:1312.6199
- Anh Nguyen, Jason Yosinski, Jeff Clune, "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images", arxiv:1412.1897

- Recommended (unprioritized):
- Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, Justin Gilmer, "Adversarial Patch", arxiv:1712.09665
- Sébastien Bubeck, Eric Price, Ilya Razenshteyn, "Adversarial examples from computational constraints",arxiv:1805.10204
- Brandon Carter, Siddhartha Jain, Jonas Mueller, David Gifford, "Overinterpretation reveals image classification model pathologies", arxiv:2003.08907 [These aren't, strictly speaking, adversarial examples, but a different pathology, showing how to mask over 90% of standard training set images, resulting in completely uninterpretable scatterings of pixels, which standard neural nets still classify with high confidence]
- Krzysztof Chalupka, Pietro Perona, Frederick Eberhardt, "Visual Causal Feature Learning", arxiv:1412.2309
- Hang Gao, Tim Oates, "Universal Adversarial Perturbation for Text Classification", arxiv:1910.04618
- Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, Stuart Russell, "Adversarial Policies: Attacking Deep Reinforcement Learning", arxiv:1905.10615
- Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, "Explaining and Harnessing Adversarial Examples", arxiv:1412.6572
- Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry, "Adversarial Examples Are Not Bugs, They Are Features", arxiv:1905.02175 [I'm not as convinced as they are that they've managed to create networks using only "robust" features that aren't vulnerable to new adversarial attacks. But I
*am*convinced that they're able to identify non-robust features and show they generalize to the original data set. --- Immediately after reading the paper, I discovered an extensive multi-author discussion, with reply, which I have not had a chance to examine, but link here.] - Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel and Matthias Bethge, "Excessive Invariance Causes Adversarial Vulnerability", International Conference on Learning Representations 2019 [This is a good paper, but somewhat astonishingly does not cite Nguyen et al. 2014]
- Juncheng Li, Frank R. Schmidt, J. Zico Kolter, "Adversarial camera stickers: A physical camera-based attack on deep learning systems", arxiv:1904.00759
- Saeed Mahloujifar, Xiao Zhang, Mohammad Mahmoody, David Evans, "Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness", arxiv:1905.12202
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, Pascal Frossard, "Universal adversarial perturbations", arxiv:1610.08401
- Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, J. Zico Kolter, "Denoised Smoothing: A Provable Defense for Pretrained Classifiers", arxiv:2003.01908 [Here's the idea, roughly: Start with a correctly classified image \( x \), so \( m(x) = c(x) \) where \( m(\cdot) \) is the classifier function and \( c(\cdot) \) is the true class. (Let's assume true class
*is*a function of the image.) An adversarial example would be a*small*perturbation a such that \( m(x+a) = d \neq c(x) \). But the adversarial perturbations aren't just small, they're a very particular set, so if we add random noise \( R \) we typically get kicked back out of the adversarial set and back in to the pre-image of \( c(x) \), thus \( m(x+a+R) = c(x) \) with high probability. So it's somehow relying on adversarial perturbations being atypical; maybe not topological "meagre" in the strict sense, but presumably also not a generic set. When this works, it must tell us something about the geometry of the decision boundaries, but I'm not smart enough to say what. I should study this very carefully.] - Adi Shamir, Itay Safran, Eyal Ronen and Orr Dunkelman, "A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance",
arxiv:1901.1086 [As they are
careful to point out, they explain the existence of adversarial examples where
a small
*number*of pixels (or other basic features) are perturbed, but perhaps by arbitrarily large amounts. Also, their explanation relies on the over-all network being piecewise linear.] - Rohan Taori, Amog Kamsetty, Brenton Chu, Nikita Vemuri, "Targeted Adversarial Examples for Black Box Audio Systems", arxiv:1805.07820
- Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson, "Universal and Transferable Adversarial Attacks on Aligned Language Models", arxiv:2307.15043 [Demos, etc.]

- Modesty forbids me to recommend:
- CRS, "Notes on 'Intriguing Properties of Neural Networks', and two other papers (2014)" [On Szegedy et al., Nguyen et al., and Chalupka et al.]

- To read (a rather promiscuous mix of examples and stabs at general explanations):
- Nilesh A. Ahuja, Ibrahima Ndiour, Trushant Kalyanpur, Omesh Tickoo, "Probabilistic Modeling of Deep Features for Out-of-Distribution and Adversarial Detection", arxiv:1909.11786
- Ulrich Aïvodji, Sébastien Gambs, Timon Ther, "GAMIN: An Adversarial Approach to Black-Box Model Inversion", arxiv:1909.11835
- Devansh Arpit, Caiming Xiong, Richard Socher, "Entropy Penalty: Towards Generalization Beyond the IID Assumption", arxiv:1910.00164
- Anish Athalye, Logan Engstrom, Andrew Ilyas, Kevin Kwok, "Synthesizing Robust Adversarial Examples", arxiv:1707.07397
- Mikhail Belkin, Daniel Hsu, Partha Mitra, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", arxiv:1806.05161
- Aleksandar Bojchevski, Stephan Günnemann, "Adversarial Attacks on Node Embeddings via Graph Poisoning", arxiv:1809.01093
- Avishek Joey Bose, Andre Cianflone, William L. Hamilton, "Generalizable Adversarial Attacks Using Generative Models", arxiv:1905.10864
- Nicholas Carlini, Ulfar Erlingsson, Nicolas Papernot, "Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications", arxiv:1910.13427
- Heng Chang, Yu Rong, Tingyang Xu, Wenbing Huang, Honglei Zhang, Peng Cui, Wenwu Zhu, Junzhou Huang, "A Restricted Black-box Adversarial Framework Towards Attacking Graph Embedding Models", arxiv:1908.01297
- Gilad Cohen, Guillermo Sapiro, Raja Giryes, "Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors", arxiv:1909.06872
- Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, "Lower Bounds for Adversarially Robust PAC Learning", arxiv:1906.05815
- Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J. Anders, Marcel Ackermann, Klaus-Robert Müller, Pan Kessel, "Explanations can be manipulated and geometry is to blame", arxiv:1906.07983
- Gamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-Dickstein, "Adversarial Reprogramming of Neural Networks", arxiv:1806.11146
- Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein, "Adversarial Examples that Fool both Computer Vision and Time-Limited Humans", arxiv:1802.08195
- Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Aleksander Madry, "Adversarial Robustness as a Prior for Learned Representations", arxiv:1906.00945
- Samuel G. Finlayson, Hyung Won Chung, Isaac S. Kohane, Andrew L. Beam, "Adversarial Attacks Against Medical Deep Learning Systems", arxiv:1804.05296
- Matthias Freiberger, Peter Kun, Anders Sundnes Lovlie, Sebastian Risi, "CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution", arxiv:2307.03798
- Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. Dahl, "Motivating the Rules of the Game for Adversarial Example Research", arxiv:1807.06732 [Not sure how much this matters to me, since I'm not interested in these as security holes so much as windows on to what the networks are doing]
- Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, Ian Goodfellow, "Adversarial Spheres", arxiv:1801.02774
- Jan Philip Göpfert, André Artelt, Heiko Wersing, Barbara Hammer, "Adversarial attacks hidden in plain sight", arxiv:1902.09286
- Melody Y. Guan, Gregory Valiant, "A Surprising Density of Illusionable Natural Speech", arxiv:1906.01040
- Chuan Guo, Jacob R. Gardner, Yurong You, Andrew Gordon Wilson, Kilian Q. Weinberger, "Simple Black-box Adversarial Attacks", arxiv:1905.07121
- Jiangfan Han, Xiaoyi Dong, Ruimao Zhang, Dongdong Chen, Weiming Zhang, Nenghai Yu, Ping Luo, Xiaogang Wang, "Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once", arxiv:1908.05185
- Xintian Han, Yuxuan Hu, Luca Foschini, Larry Chinitz, Lior Jankelson, Rajesh Ranganath, "Adversarial Examples for Electrocardiograms", arxiv:1905.05163
- Hangfeng He, Weijie J. Su, "The Local Elasticity of Neural Networks", arxiv:1910.06943
- Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei Wei, Wen-Lian Hsu, Cho-Jui Hsieh, "Natural Adversarial Sentence Generation with Gradient-based Perturbation", arxiv:1909.04495
- Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, Yu-Gang Jiang, "Black-box Adversarial Attacks on Video Recognition Models", arxiv:1904.05181
- Di Jin, Zhijing Jin, Joey Tianyi Zhou, Peter Szolovits, "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment", arxiv:1907.11932
- Jason Jo, Yoshua Bengio, "Measuring the tendency of CNNs to Learn Surface Statistical Regularities", arxiv:1711.11561
- Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein and J. D. Tygar, Adversarial Machine Learning [My judgment is that no book on ML has a more visually-apt cover than this one]
- Ameya Joshi, Amitangshu Mukherjee, Soumik Sarkar, Chinmay Hegde, "Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers", arxiv:1904.08489
- Karl M. Koerich, Mohammad Esmailpour, Sajjad Abdoli, Alceu S. Britto Jr., Alessandro L. Koerich, "Cross-Representation Transferability of Adversarial Perturbations: From Spectrograms to Audio Waveforms", arxiv:1910.10106
- Cassidy Laidlaw, Soheil Feizi, "Functional Adversarial Attacks", arxiv:1906.00001
- Alfred Laugros, Alice Caplier, Matthieu Ospici, "Are Adversarial Robustness and Common Perturbation Robustness Independent Attributes?", arxiv:1909.02436
- Bai Li, Changyou Chen, Wenlin Wang, Lawrence Carin, "Certified Adversarial Robustness with Additive Gaussian Noise", arxiv:1809.03113 [I'm skeptical, but I haven't read it]
- Daniel Liu, Ronald Yu, Hao Su, "Adversarial point perturbations on 3D objects", arxiv:1908.06062
- Wenjian Luo, Chenwang Wu, Nan Zhou, Li Ni, "Random Directional Attack for Fooling Deep Neural Networks", arxiv:1908.02658
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu, "Towards Deep Learning Models Resistant to Adversarial Attacks", arxiv:1706.06083
- Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, "SparseFool: a few pixels make a big difference", arxiv:1811.02248
- Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar, "Adversarial Reprogramming of Text Classification Neural Networks", arxiv:1809.01829
- Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, Farinaz Koushanfar, "Universal Adversarial Perturbations for Speech Recognition Systems", arxiv:1905.03828
- Utku Ozbulak, Arnout Van Messem, Wesley De Neve, "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation", arxiv:1907.13124
- Daniel Park, Haidar Khan, Bülent Yener, "Generation & Evaluation of Adversarial Examples for Malware Obfuscation", arxiv:1904.04802
- Mikhail Pautov, Grigorii Melnikov, Edgar Kaziakhmedov, Klim Kireev, Aleksandr Petiushko, "On adversarial patches: real-world attack on ArcFace-100 face recognition system", arxiv:1910.07067
- Dan Peng, Zizhan Zheng, Linhao Luo, Xiaofeng Zhang, "Structure Matters: Towards Generating Transferable Adversarial Images", arxiv:1910.09821
- Aram-Alexandre Pooladian, Chris Finlay, Tim Hoheisel, Adam Oberman, "A principled approach for generating adversarial images under non-smooth dissimilarity metrics", arxiv:190801667
- Shahbaz Rezaei, Xin Liu, "A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning", arxiv:1904.04334
- Yaniv Romano, Aviad Aberdam, Jeremias Sulam, Michael Elad, "Adversarial Noise Attacks of Deep Learning Architectures -- Stability Analysis via Sparse Modeled Signals", arxiv:1805.11596
- Andras Rozsa, Terrance E. Boult, "Improved Adversarial Robustness by Reducing Open Space Risk via Tent Activations", arxiv:1908.02435
- Lea Schönherr, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa, "Robust Over-the-Air Adversarial Examples Against Automatic Speech Recognition Systems", arxiv:1908.01551
- Ali Shahin Shamsabadi, Changjae Oh, Andrea Cavallaro, "EdgeFool: An Adversarial Image Enhancement Filter", arxiv:1910.12227
- Shawn Shan, Emily Wenger, Bolun Wang, Bo Li, Haitao Zheng, Ben Y. Zhao, "Using Honeypots to Catch Adversarial Attacks on Neural Networks", arxiv:1904.08554
- Chaomin Shen, Yaxin Peng, Guixu Zhang, Jinsong Fan, "Defending Against Adversarial Attacks by Suppressing the Largest Eigenvalue of Fisher Information Matrix", arxiv:1909.06137
- Xupeng Shi, A. Adam Ding, "Understanding and Quantifying Adversarial Examples Existence in Linear Classification", arxiv:1910.12163
- Jacob M. Springer, Melanie Mitchell, Garrett T. Kenyon, "Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers", arxiv:2102.05110
- Vinod Subramanian, Emmanouil Benetos, Ning Xu, SKoT McDonald, Mark Sandler, "Adversarial Attacks in Sound Event Classification", arxiv:1907.02477
- Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Julian Schrittwieser, Thomas Hubert, Michael Bowling, "Approximate exploitability: Learning a best response in large games", arxiv:2004.09677
- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh, "Universal Adversarial Triggers for Attacking and Analyzing NLP", arxiv:1908.07125
- Walt Woods, Jack Chen, Christof Teuscher, "Adversarial Explanations for Understanding Image Classification Decisions and Improved Neural Network Robustness", arxiv:1906.02896
- Chang Xiao, Peilin Zhong, Changxi Zheng, "Resisting Adversarial Attacks by k-Winners-Take-All", arxiv:1905.10510
- Qi Xuan, Jun Zheng, Lihong Chen, Shanqing Yu, Jinyin Chen, Dan Zhang, Qingpeng Zhang Member, "Unsupervised Euclidean Distance Attack on Network Embedding", arxiv:1905.11015
- Jirong Yi, Hui Xie, Leixin Zhou, Xiaodong Wu, Weiyu Xu, Raghuraman Mudumbai, "Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks", arxiv:1905.11381
- Xuwang Yin, Soheil Kolouri, Gustavo K. Rohde, "Divide-and-Conquer Adversarial Detection", arxiv:1905.11475
- Tao Yu, Shengyuan Hu, Chuan Guo, Wei-Lun Chao, Kilian Q. Weinberger, "A New Defense Against Adversarial Images: Turning a Weakness into a Strength", arxiv:1910.07629
- Yuan Zang, Chenghao Yang, Fanchao Qi, Zhiyuan Liu, Meng Zhang, Qun Liu, Maosong Sun, "Open the Boxes of Words: Incorporating Sememes into Textual Adversarial Attack", arxiv:1910.12196
- Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, Liwei Wang, "Adversarially Robust Generalization Just Requires More Unlabeled Data", arxiv:1906.00555
- Jiliang Zhang, Chen Li, "Adversarial Examples: Opportunities and Challenges", arxiv:1809.04790 [Review paper]
- Pu Zhao, Sijia Liu, Pin-Yu Chen, Nghia Hoang, Kaidi Xu, Bhavya Kailkhura, Xue Lin, "On the Design of Black-box Adversarial Examples by Leveraging Gradient-free Optimization and Operator Splitting Method", arxiv:1907.11684