Text Mining (and Information Retrieval)

Last update: 25 Jun 2025 14:39
First version: September 2013 (or earlier?)

At the intersection of data mining and linguistics.

Actually, "Dr. Internet" Is the Name of the Monsters' Creator
Joint modeling of text and networks
Topic Models get their own notebook
"Attention", "Transformers", in Neural Network "Large Language Models"

David M. Blei and John D. Lafferty, "Topic Models", in A. Srivastava and M. Sahami (eds.), Text Mining: Theory and Applications (2009) [PDF]
David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
T. K. Landauer and S. T. Dumais, "A Solution to Plato's Problem: The Latent Sematic Analysis Theory of the Acquisition, Induction and Representation of Knowledge", Psychological Review 104 (1997):211--240

David Dubin, "The Most Influential Paper Gerard Salton Never Wrote"
Alison Gopnik, "What AI Still Doesn't Know How to Do; Artificial intelligence programs that learn to write and speak can sound almost human—but they can't think creatively like a small child can", Wall Street Journal 15 July 2022 [Comments]
Yoav Goldberg, "A Primer on Neural Network Models for Natural Language Processing", Journal of Artificial Intelligence Research 57 (2016): 345--420 [Too early to include "transformers".]
Zubin Jelveh, Bruce Kogut and Suresh Naidu, "Political Language in Economics", ssrn/2535453
Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin, "``Low-Resource'' Text Classification: A Parameter-Free Classification Method with Compressors", Findings of the Association for Computational Linguistics [ACL 2023], pp. 6810--6828
Franco Moretti and Dominique Pestre, "Bankspeak: The Language of World Bank Reports", New Left Review 92 (March-April 2015)
Brendan O'Connor, Statistical Text Analysis for Social Science [Ph.D. thesis, 2014, CMU department of Machine Learning]
Peter D. Turney and Michal L. Littman, "Corpus-based Learning of Analogies and Semantic Relations", arxiv:cs/0508103
Samuel L. Ventura, Rebecca Nugent and Erica R. H. Fuchs, "Methods Matter: Revamping Inventor Disambiguation Algorithms with Classification Models and Labeled Inventor Records", SSRN/2079330

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman, "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science 41 (1990): 391--407 [PDF reprint]
Susan T. Dumais, Todd A. Letsche, Michael L. Littman and Thomas K. Landuaer, "Automatic Cross-Language Retrieval Using Latent Semantic Indexing", AAAI Spring Symposium 1997
George W. Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, Richard A. Harshman, Lynn A. Streeter and Karen E. Lochbaum, "Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure", pp. 465-480 in Yves Chiaramella (ed.), SIGIR '88 [PDF reprint]

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
Marco Baroni, Georgiana Dinu and German Kruszewski, "Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", pp. 238--247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method", arxiv:1402.3722
Jeffrey Pennington, Richard Socher, Christopher Manning, "GloVe: Global Vectors for Word Representation", pp. 1532--1543 in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing [EMNLP 2014]
Kian Kenyon-Dean, "Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form", arxiv:1911.02639
Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama, "A Mutual Information Maximization Perspective of Language Representation Learning", arxiv:1910.08350
Rachel Carrington, Karthik Bharath, Simon Preston, "Invariance and identifiability issues for word embeddings", arxiv:1911.02656

My lecture notes for my data mining class

M. E. Maron and J. L. Kuhns, "On Relevance, Probabilistic Indexing and Information Retrieval", Journal of the ACM 7 (196): 216--244

R. Harald Baayen, Analyzing Linguistic Data: A Practical Introduction to Statistics Using R
Jonathan Bischof, Edoardo Airoldi, "Poisson convolution on a tree of categories for modeling topical content with word frequency and exclusivity", ICML 2012, arxiv:1206.4631
Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines
Koby Crammer, Mark Dredze, Fernando Pereira, "Confidence-Weighted Linear Classification for Text Categorization", Journal of Machine Learning Research 13 (2012): 1891--1926
Khalid El-Arini, Emily B. Fox, Carlos Guestrin, "Concept Modeling with Superwords", arxiv:1204.2523
Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Andrey Feuerverger, Peter Hall, Gelila Tilahun, Michael Gervers, "Using statistical smoothing to date medieval manuscripts", arxiv:0805.2490
James M. Hughes, Nicholas J. Foti, David C. Krakauer, and Daniel N. Rockmore, "Quantitative patterns of stylistic influence in the evolution of literature", Proceedings of the National Academy of Sciences (USA) 109 (2012): 7682--7686
Martin Klein and Michael L. Nelson, "Approximating Document Frequency with Term Count Values", arxiv:0807.3755 [Approximating inverse document frequency for the web (or other unsurveyable corpora) by term frequency]
Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow, "Measuring Differentiability: Unmasking Pseudonymous Authors", Journal of Machine Learning Research 8 (2007): 1261--1276
Guy Lebanon, "Sequential Document Representations and Simplicial Curves", UAI 2006, arxiv:1206.6858
Guy Lebanon, Yang Zhao, and Yanjun Zhao, "Modeling temporal text streams using the local multinomial model", Electronic Journal of Statistics 4 (2010): 566--584
Jure Leskovec, Lars Backstrom and Jon Kleinberg, "Meme-tracking and the Dynamics of the News Cycle", KDD 2009, pp. 497--506
Yu-Ru Lin, Drew Margolin, Brian Keegan, Andrea Baronchelli, David Lazer, "#Bigbirds Never Die: Understanding Social Dynamics of Emergent Hashtag", ICWSM 2013, arxiv:1303.7144
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval
Mircea Petrache, Shubhendu Trivedi, "Position Paper: Generalized grammar rules and structure-based generalization beyond classical equivariance for lexical tasks and transduction", arxiv:2402.01629
Daniel M. Romero, Chenhao Tan, Johan Ugander, "Social-Topical Affiliations: The Interplay between Structure and Popularity", arxiv:1112.1115
Julia Silge and David Robinson, Text Mining with R: A Tidy Approach
Jeffrey Solka, "Text Data Mining: Theory and Methods", Statistical Surveys 2 (2008): 94--112 = arxiv:0807.2569
Gelila Tilahun, Andrey Feuerverger, and Michael Gervers, "Dating medieval English charters", Annals of Applied Statistics 6 (2012): 1615--1640, arxiv:1301.2405
D. Volk and M. G. Stepanov, "Resampling methods for document clustering," cond-mat/0109006