Text Mining (and Information Retrieval)
01 Apr 2024 10:18
At the intersection of data mining and linguistics.
- See also:
- Actually, "Dr. Internet" Is the Name of the Monsters' Creator
- Joint modeling of text and networks
- Topic Models get their own notebook
- "Attention", "Transformers", in Neural Network "Large Language Models"
- Recommended, big picture:
- David M. Blei and John D. Lafferty, "Topic Models", in A. Srivastava and M. Sahami (eds.), Text Mining: Theory and Applications (2009) [PDF]
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
- T. K. Landauer and S. T. Dumais, "A Solution to Plato's Problem: The Latent Sematic Analysis Theory of the Acquisition, Induction and Representation of Knowledge", Psychological Review 104 (1997):211--240
- Recommended, close-ups:
- David Dubin, "The Most Influential Paper Gerard Salton Never Wrote"
- Alison Gopnik, "What AI Still Doesn't Know How to Do; Artificial intelligence programs that learn to write and speak can sound almost human—but they can't think creatively like a small child can", Wall Street Journal 15 July 2022 [Comments]
- Yoav Goldberg, "A Primer on Neural Network Models for Natural Language Processing", Journal of Artificial Intelligence Research 57 (2016): 345--420 [Too early to include "transformers".]
- Zubin Jelveh, Bruce Kogut and Suresh Naidu, "Political Language in Economics", ssrn/2535453
- Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin, "``Low-Resource'' Text Classification: A Parameter-Free Classification Method with Compressors", Findings of the Association for Computational Linguistics [ACL 2023], pp. 6810--6828
- Franco Moretti and Dominique Pestre, "Bankspeak: The Language of World Bank Reports", New Left Review 92 (March-April 2015)
- Brendan O'Connor, Statistical Text Analysis for Social Science [Ph.D. thesis, 2014, CMU department of Machine Learning]
- Peter D. Turney and Michal L. Littman, "Corpus-based Learning of Analogies and Semantic Relations", arxiv:cs/0508103
- Samuel L. Ventura, Rebecca Nugent and Erica R. H. Fuchs, "Methods Matter: Revamping Inventor Disambiguation Algorithms with Classification Models and Labeled Inventor Records", SSRN/2079330
- Recommended, on latent-semantic indexing:
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman, "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science 41 (1990): 391--407 [PDF reprint]
- Susan T. Dumais, Todd A. Letsche, Michael L. Littman and Thomas K. Landuaer, "Automatic Cross-Language Retrieval Using Latent Semantic Indexing", AAAI Spring Symposium 1997
- George W. Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, Richard A. Harshman, Lynn A. Streeter and Karen E. Lochbaum, "Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure", pp. 465-480 in Yves Chiaramella (ed.), SIGIR '88 [PDF reprint]
- Recommended, on new-school vector embeddings of words using neural networks (departure from my usual alphabetical order deliberate):
- Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arxiv:1301.3781
- Marco Baroni, Georgiana Dinu and German Kruszewski, "Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors", pp. 238--247 in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics [ACL14] (Baltimore, MD: Association for Computational Linguistics, 2014)
- Yoav Goldberg and Omer Levy, "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method", arxiv:1402.3722
- Jeffrey Pennington, Richard Socher, Christopher Manning, "GloVe: Global Vectors for Word Representation", pp. 1532--1543 in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing [EMNLP 2014]
- Kian Kenyon-Dean, "Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form", arxiv:1911.02639
- Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama, "A Mutual Information Maximization Perspective of Language Representation Learning", arxiv:1910.08350
- Rachel Carrington, Karthik Bharath, Simon Preston, "Invariance and identifiability issues for word embeddings", arxiv:1911.02656
- Modesty forbids me to recommend:
- My lecture notes for my data mining class
- To read:
- R. Harald Baayen, Analyzing Linguistic Data: A Practical Introduction to Statistics Using R
- Jonathan Bischof, Edoardo Airoldi, "Poisson convolution on a tree of categories for modeling topical content with word frequency and exclusivity", ICML 2012, arxiv:1206.4631
- Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines
- Koby Crammer, Mark Dredze, Fernando Pereira, "Confidence-Weighted Linear Classification for Text Categorization", Journal of Machine Learning Research 13 (2012): 1891--1926
- Khalid El-Arini, Emily B. Fox, Carlos Guestrin, "Concept Modeling with Superwords", arxiv:1204.2523
- Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
- Andrey Feuerverger, Peter Hall, Gelila Tilahun, Michael Gervers, "Using statistical smoothing to date medieval manuscripts", arxiv:0805.2490
- James M. Hughes, Nicholas J. Foti, David C. Krakauer, and Daniel N. Rockmore, "Quantitative patterns of stylistic influence in the evolution of literature", Proceedings of the National Academy of Sciences (USA) 109 (2012): 7682--7686
- Martin Klein and Michael L. Nelson, "Approximating Document Frequency with Term Count Values", arxiv:0807.3755 [Approximating inverse document frequency for the web (or other unsurveyable corpora) by term frequency]
- Moshe Koppel, Jonathan Schler, Elisheva Bonchek-Dokow, "Measuring Differentiability: Unmasking Pseudonymous Authors", Journal of Machine Learning Research 8 (2007): 1261--1276
- Guy Lebanon, "Sequential Document Representations and Simplicial Curves", UAI 2006, arxiv:1206.6858
- Guy Lebanon, Yang Zhao, and Yanjun Zhao, "Modeling temporal text streams using the local multinomial model", Electronic Journal of Statistics 4 (2010): 566--584
- Jure Leskovec, Lars Backstrom and Jon Kleinberg, "Meme-tracking and the Dynamics of the News Cycle", KDD 2009, pp. 497--506
- Yu-Ru Lin, Drew Margolin, Brian Keegan, Andrea Baronchelli, David Lazer, "#Bigbirds Never Die: Understanding Social Dynamics of Emergent Hashtag", ICWSM 2013, arxiv:1303.7144
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval
- Daniel M. Romero, Chenhao Tan, Johan Ugander, "Social-Topical Affiliations: The Interplay between Structure and Popularity", arxiv:1112.1115
- Julia Silge and David Robinson, Text Mining with R: A Tidy Approach
- Jeffrey Solka, "Text Data Mining: Theory and Methods", Statistical Surveys 2 (2008): 94--112 = arxiv:0807.2569
- Gelila Tilahun, Andrey Feuerverger, and Michael Gervers, "Dating medieval English charters", Annals of Applied Statistics 6 (2012): 1615--1640, arxiv:1301.2405
- D. Volk and M. G. Stepanov, "Resampling methods for document clustering," cond-mat/0109006