Via Language Log, there comes this fun post on the distribution of the number of repetitions of "R" in strings of the form "AR+", as in, "Arrr, mateys!" These findings, like Mark Liberman's on "AW+", are in line with the results of the seminal paper in this area, Dennis Chao and Patrik D'haeseleer's "The Distribution of Variable-length Phatic Interjectives on the World Wide Web" (University of New Mexico Computer Science Department Tech Report TR-CS-2001-23). I eagerly await further results in this exciting pico-field.
Being what I am, however, I can't resist pointing out that looking for a straight line on a log-log plot, and even finding one with high r-squared, is simply not a reliable way of checking whether a distribution is a power-law. Please do not do this. (And yes, I should be finishing that paper on the right approach, rather than blogging.)
Posted at September 20, 2006 13:23 | permanent link