Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenGrammatical Inference
http://bactra.org/notebooks/2018/10/19#grammatical-inference
<P>Meaing: inferring the rules of a formal language (its grammar) from examples,
positive or negative. I'm mostly interested in the positive case, since
I want to describe physical processes as though they were formal languages
or (what is equivalent) automata.
<P>— A result which has recently come to vex me is the fact that even
finite automata are, in the computational learning theory sense, hard to learn.
It's widely believed, but not proved, that common cryptographic systems are
hard to crack, meaning that there are no efficient, polynomial-time algorithms
for them (unless you have the key, of course...). It turns out that the
ability to learn polynomially-sized deterministic finite automata in polynomial
time would imply the ability to defeat RSA in polynomial time, which is a
pretty good indication that it can't be done. (See Kearns and Vazirani for a
very nice discussion.) Initial results were about uniform distributions over
words, but it turns out further that this holds even when the distribution of
words is generated by a stochastic automaton of the same form.
<P>This is extremely annoying to me, because I want to learn stochastic
automata from time series, and to do so in polynomial time (if not better).
One possibility is that the time-complexity is just bad, in the worst case, and
there is nothing to be done about this. (I fear this might be true.) This
would not, however, address the scaling of prediction error and confidence with
sample size, which is really what interests me. The other possibility is that
I am interested in a rather different set-up than this, one where it's crucial
that, as <em>n</em> grows, we see the continuation of a single sample path from
a fixed automaton. (In the language, every allowed word is the prefix of
infinitely many other allowed words.) Or: the experiment can go on forever.
(In fact I'm often interested in the situation where the experiment could have
been running forever, so every word is the suffix of infinitely many words.) I
think, though I am not sure, that the ability to infer these languages quickly
would not cause cryptographic horrors, because I think this restriction breaks
the crypto-to-DFA mapping.
<P>See also: <a href="learning-theory.html">Computational
Learning Theory</a>;
<a href="computational-mechanics.html">Computational Mechanics</a>;
<a href="linguistics.html">Linguistics</a>;
<a href="learning-inference-induction.html">Machine Learning,
Statistical Inference and Induction</a>;
<a href="transducers.html">Transducers</a>
<ul>Recommended (big picture):
<li>Remo Badii and Antonio Politi, <cite>Complexity: Hierarchical
Structure and Scaling in Physics</cite> [Treating dynamical systems like
formal languages. <a href="../reviews/badii-and-politi/">Review</a>.]
<li>Eugene Charniak, <cite>Statistical Language Learning</cite> [Good
stuff on learning grammars for the two lowest levels of the Chomsky hierarchy;
explains the grammar ideas for the benefit of engineers.]
<li>Colin de la Higuera, <cite>Grammatical Inference: Learning Automata
and Grammars</cite>
[<a href="../weblog/algae-2011-03.html#de-la-higuera">Mini-review</a>]
<li>Michael J. Kearns and Umesh V. Vazirani, <cite>An Introduction to
Computational Learning Theory</cite> [<a
href="../reviews/kearns-vazirani/">Review: How to Build a Better Guesser</a>]
<li>Christopher D. Manning and Hinrich Shutze, <cite>Foundations of
Statistical Natural Language Processing</cite>
</ul>
<ul>Recommended (close-ups):
<li>P. Dupont, F. Denis and Y. Esposito, "Links between probabilistic
automata and hidden Markov models: probability distributions, learning models
and induction algorithms", <a
href="http://dx.doi.org/10.1016/j.patcog.2004.03.020"><cite>Pattern
Recognition</citE> <strong>38</strong> (2005): 1349--1371</a>
<li>Jim Engle-Warnick, William J. McCausland and John H. Miller,
"The Ghost in the Machine: Inferring Machine-Based Strategies from
Observed Behavior" [i.e., inferring stochastic transducers from data; hence
the inclusion here]
<li>Craig G. Nevill-Manning and Ian H. Witten, "Identifying
Hierarchical Structure in Sequences: a Linear-Time Algorithm,"
<a href="http://arxiv.org/abs/cs.AI/970910">arxiv:cs.AI/970910</a>
[Scheme for
inferring context-free grammars from sequential data streams; no consideration
of probabilistic properties]
<li>Leonid Peshkin, "Structure induction by lossless graph compression",
<a href="http://arxiv.org/abs/cs.DS/0703132">arxiv:cs.DS/0703132</a>
[Adapting
data-compression ideas, a la Nevill-Manning and Witten, to graphs]
<li>V. I. Propp, <cite><a href="algae-2018-08.html#propp">Morphology of the Folktale</a></cite> [Inducing
a regular grammar from the plots of Russian fairytales --- in the 1920s. Further details and comments at the link.]
<li>Patrick Suppes, <cite>Representation and Invariance of Scientific
Structures</cite> [Provides a very counter-intuitive proof, originally
presented in papers from the 1960s and 1970s, that certain stimulus-response
learning models can, asymptotically, become isomorphic to arbitrary grammars.]
<li>Sebatiaan A. Terwijn, "On the Learnability of Hidden Markov
Models", <a href="http://dx.doi.org/10.1007/3-540-45790-9">pp. 344--348 in
P. Adriaans, H. Fornau and M. van Zaanen (eds.), <cite>Grammatical Inference:
Algorithms and Applications</cite>, Lecture Notes in Computer
Science <strong>2484</strong> (2002)</a>
[Straightforward, but it leans a lot on wanting to learn languages over words of fixed length, whereas for the cases of "physical", i.e., dynamical, interest, one is restricted to languages where every word has at least one one-letter extension in the language (and so, by induction, every finite word is the prefix of infinitely many words).]
</ul>
<ul>To read:
<li>Hendrik Blockeel, Robert Brijder, "Non-Confluent NLC Graph Grammar Inference by Compressing Disjoint Subgraphs", <a href="http://arxiv.org/abs/0901.4876">arxiv:0901.4876</a>
<li>Andreas Blume, "A Learning-Efficiency Explanation of Structure in
Language", <a href="http://dx.doi.org/10.1007/s11238-005-0280-1"><cite>Theory
and Decision</cite> <strong>57</strong> (2004): 265--285</a>
<li>Miguel Bugalho and Arlindo L. Oliveira, "Inference
of regular languages using state merging algorithms with search",
<a href="http://dx.doi.org/10.1016/j.patcog.2004.03.027"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1457--1467</a>
<li>Francisco Casacuberta, Enrique Vidal and David Picó,
"Inference of finite-state transducers from regular languages",
<a href="http://dx.doi.org/10.1016/j.patcog.2004.03.025"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1431--1443</a> ["Given a
training corpus of input-output pairs of sentences, the proposed approach uses
statistical alignment methods to produce a set of conventional strings from
which a stochastic finite-state grammar is inferred. This grammar is finally
transformed into a resulting finite-state transducer."]
<li>Alexander Clark, Christophe Costa Florencio and Chris
Watkins, "Languages as hyperplanes: grammatical inference with string
kernels", <a href="http://dx.doi.org/10.1007/s10994-010-5218-3"><cite>Machine Learning</cite>
<strong>82</strong> (2011): 351--373</a>
<li>Alexander Clark, Rémi Eyraud, Amaury Habrard, "Using Contextual Representations to Efficiently Learn Context-Free Languages", <a href="http://jmlr.csail.mit.edu/papers/v11/clark10a.html"><cite>Journal of Machine Learning Research</cite> <strong>11</strong> (2010): 2707--2744</a>
<li>Shay B. Cohen and Noah A. Smith
<ul>
<li>"Empirical Risk Minimization with Approximations of Probabilistic Grammars", NIPS 23 (2010) [<a href="http://books.nips.cc/papers/files/nips23/NIPS2010_0803.pdf">PDF</a>]
<li>"Covariance in Unsupervised Learning of Probabilistic Grammars", <a href="http://jmlr.csail.mit.edu/papers/v11/cohen10a.html">Journal of Machine Learning Research</cite> <strong>11</strong> (2010): 3017--3051</a>
</ul>
<li>Trevor Cohn, Phil Blunsom, Sharon Goldwater, "Inducing Tree-Substitution Grammars", <a href="http://jmlr.csail.mit.edu/papers/v11/cohn10b.html"><cite>Journal of Machine Learning Research</cite> <strong>11</strong> (2010): 3053--3096</a>
<li>P. Collet, A. Galves and A. Lopes, "Maximum Likelihood and Minimum
Entropy Identfication of Grammars," <cite>Random and Computational
Dynamics</cite> <strong>3</strong> (1995): 241--250
<li>Colin de la Higuera, "A bibliographical study of grammatical
inference", <a
href="http://dx.doi.org/10.1016/j.patcog.2005.01.003"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1332--1348</a>
<li>C. de la Higuera and J. C. Janodet, "Inference of \omega-languages
from prefixes", <cite>Theoretical Computer Science</cite> <strong>313</strong>
(2004): 295--312 [<a
href="http://dx.doi.org/10.1016/j.tcs.2003.11.009">abstract</a>]
<li>Francois Denis, Aurelien Lemay and Alain Terlutte, "Learning
regular languages using RFSAs", <cite>Theoretical Computer Science</cite>
<strong>313</strong> (2004): 267--294 [<a
href="http://dx.doi.org/10.1016/j.tcs.2003.11.008">abstract</a>]
<li>Francois Denis, Yann Esposito and Amaury Habrard, "Learning
rational stochastic
languages", <a href="http://arxiv.org/abs/cs.LG/0602062">cs.LG/0602062</a>
<li>Jeroen Geertzen, "String Alignment in Grammatical Inference: what
suffix trees can do"
[<a href="http://ilk.uvt.nl/downloads/pub/papers/ilk0311.pdf">PDF</a>]
<li>C. Lee Giles, Steve Lawrence and Ah Chung Tsoi, "Noisy Time Series
Prediction Using Recurrent Neural Networks and Grammatical Inference,"
<cite>Machine Learning</cite> <strong>44</strong> (2001): 161--183
<li>James Henderson, Ivan Titov, "Incremental Sigmoid Belief Networks for Grammar Learning", <a href="http://jmlr.csail.mit.edu/papers/v11/henderson10a.html"><citE>Journal of Machine Learning Research</cite> <strong>11</strong> (2010): 3541--3570</a>
<li>Mark Johnson, Thomas L. Griffiths and Sharon Goldwater, "Adaptor
Grammars: A Framework for Specifying Composition Nonparametric Bayesian
Models", <cite>NIPS</citE> 19 [But containing significant typos; see
<a
href="http://www.cog.brown.edu/~mj/papers/JohnsonGriffithsGoldwater06AdaptorGrammars.pdf">version
at Johnson's website</a>]
<li>Bill Keller and Rudi Lutz, "Evolutionary induction of stochastic context free grammars", <a href="http://dx.doi.org/10.1016/j.patcog.2004.03.022"><cite>Pattern Recognition</cite> <strong>38</strong> (2005): 1393--1406</a>
<li>Dan Klein and Christopher D. Manning, "Natural language grammar
induction with a generative constituent-context model", <a
href="http://dx.doi.org/10.1016/j.patcog.2004.03.023"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1407--1419</a> ["We present a
generative probabilistic model for the unsupervised learning of hierarchical
natural language syntactic structure. Unlike most previous work, we do not
learn a context-free grammar, but rather induce a distributional model of
constituents which explicitly relates constituent yields and their linear
contexts.... [Gets the] best published unsupervised parsing results on the ATIS
corpus...."]
<li><a href="http://www.cs.cmu.edu/~lkontor/">Leo Kontorovich</a>, John
Lafferty and David Blei, "Variational Inference and Learning for a Unified
Model of Syntax, Semantics and Morphology"
[<a
href="http://reports-archive.adm.cs.cmu.edu/anon/cald/abstracts/06-100.html">Abstract,
PDF</a>]
<li>Steffen Lange and Thomas Zeugmann, "Incremental Learning from
Positive Data," <cite>Journal of Computer and System Sciences</cite>
<strong>53</strong> (1996): 88--103
<li>S. M. Lucas and T. J. Reynolds, "Learning Deterministic Finite
Automata with a Smart State Labeling Evolutionary Algorithm", <a
href="http://dx.doi.org/10.1109/TPAMI.2005.143"><cite>IEEE Transactions on
Pattern Analysis and Machine Intelligence</cite>
<strong>27</strong> (2005): 1063--1074</a>
<li>Marcelo A. Montemurro and Pedro A. Pury, "Long-range fractal
correlations in literary corpora," <a
href="http://arxiv.org/abs/cond-mat/0201139">cond-mat/0201139</a> [The paper
doesn't consider grammars, but it's an effect which grammatical inference needs
to be able to handle]
<li>Katsuhiko Nakamura and Masashi Matsumoto, "Incremental learning of
context free grammars based on bottom-up parsing and search", <a
href="http://dx.doi.org/10.1016/j.patcog.2005.01.004"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1384--1392</a>
<li>Partha Niyogi
<ul>
<li><cite>The Informational Complexity of Learning:
Perspectives on Neural Networks and Generative Grammars</cite> [How many licks
<em>does</em> it take to get to the core of a context-free grammar, Uncle
Noam?]
<li><cite>The Computational Nature of Language Learning and Evolution</cite> [<a href="http://mitpress.mit.edu/0-262-14094-2">Blurb</a>]
</ul>
<li>Arlindo L. Oliveira and Joao P. M. Silva, "Efficient Algorithms for
the Inference of Minimum Size DFAs," <cite>Machine Learning</cite>
<strong>44</strong> (2001): 93--119
<li>David Pico and Francisco Casacuberta, "Some Statistical-Estimation
Methods for Stochastic Finite-State Transducers," <cite>Machine
Learning</cite> <strong>44</strong> (2001): 121--141
<li>Paul Prasse, Christoph Sawade, Niels Landwehr, Tobias Scheffer,
"Learning to Identify Regular Expressions that Describe Email Campaigns", <a href="http://arxiv.org/abs/1206.4637">arxiv:1206.4637</a>
<li>Detlef Prescher, "A Tutorial on the Expectation-Maximization
Algorithm Including Maximum-Likelihood Estimation and EM Training of
Probabilistic Context-Free Grammars", <a
href="http://arxiv.org/abs/cs.CL/0412015">cs.CL/0412015</a>
<li>Juan Ramón Rico-Juan, Jorge Calera-Rubio and Rafael
C. Carrasco, "Smoothing and compression with stochastic k-testable tree
languages", <a
href="http://dx.doi.org/10.1016/j.patcog.2004.03.024"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1420--1430</a>
<li>Peter Rossmanith and Thomas Zeugmann, "Stochastic Finite Learning
of the Pattern Languages," <cite>Machine Learning</cite> <strong>44</strong>
(2001): 67--91
<li>Yasubumi Sakakibara
<ul>
<li>"Grammatical Inference in Bioinformatics", <a
href="http://dx.doi.org/10.1109/TPAMI.2005.140"><cite>IEEE Transactions on
Pattern Analysis and Machine Intelligence</cite> <strong>27</strong> (2005):
1051--1062</a>
<li>"Learning context-free grammars using tabular
representations", <a
href="http://dx.doi.org/10.1016/j.patcog.2004.03.021"><cite>Pattern
Recognition</cite> <strong>38</strong> (2005): 1272--1383</a> ["By employing
this representation... the problem of learning context-free grammars from
[positive and negative] examples can be reduced to the problem of partitioning
the set of nonterminals. We use genetic algorithms for solving this
partitioning problem."]
</ul>
<li>Muddassar A. Sindhu, Karl Meinke, "IDS: An Incremental Learning Algorithm for Finite Automata", <a href="http://arxiv.org/abs/1206.2691">arxiv:1206.2691</a>
<li>Patrick Suppes, <cite>Language for Humans and Robots</cite>
<li>J. L. Verdu-Mas, R. C. Carrasco and J. Calera-Rubio, "Parsing with
Probabilistic Strictly Locally Testable Tree Languages", <a
href="http://dx.doi.org/10.1109/TPAMI.2005.144"><cite>IEEE Transactions on
Pattern Analysis and Machine Intelligence</citE> <strong>27</strong> (2005):
1040--1050</a>
<li>Sicco Verwer, Mathijs de Weerdt and Cees Witteveen, "Efficiently identifying deterministic real-time automata from labeled data", <a href="http://dx.doi.org/10.1007/s10994-011-5265-4"><cite>Machine Learning</cite> <strong>86</strong> (2012): 295--333</a>
</ul>