August 08, 2003


Dear Friend,

If you're anything like me, hardly a day goes by without someone coming up to you and handing you a long discrete-valued time series and asking you to find a hidden Markov model for it. Then there follows the endless drudgery of implementing the Baum-Welch algorithm, tuning parameters and extracting a state sequence from the Viterbi algorithm, only to decide that the Markovian structure you chose in the first place was just plain wrong, and starting over from scratch.

What would you pay, Friend, for an algorithm which let you give up all that drudgery for good? What would you pay for an algorithm which would infer an appropriate architecture from your data? How much more would you pay if it would provably converge1 on the optimal architecture?

Don't name your price yet! Suppose that with one simple click you could actually download a debugged implementation of this algorithm. Suppose you could even read the source-code. Suppose the code was not some grotty piece of physics-grad-student Fortran, but thoroughly-commented C++ written in accordance with actual software engineering principles by somebody who knows them. Now how much would you pay?

Friend, unless you answered "nothing", you'd pay too much. I present, for your consideration, CSSR, distributed for free under the Gnu Public License.

Seriously, this is the implementation of the algorithm described in my SFI working paper with Kris and Jim Crutchfield, called, imaginatively enough, "An Algorithm for Pattern Discovery in Time Series", cs.LG/0210025. I developed most of the theory, Kris saw why my original idea wouldn't work and what kind of thing would be needed to fix it, as well as doing all the actual programming, and Jim sponsored the project and kept thinking up nasty test-cases and daring us to run the program on them. I'm not sure anyone reading this weblog would actually care about the program, but I thought I'd give it a chance, and anyway this will help Google see it faster.

[1] Some restrictions apply. See cs.LG/0210025 for details.

Complexity ; Self-Centered ; Enigmas of Chance

Posted at August 08, 2003 17:09 | permanent link

Three-Toed Sloth