December 04, 2013

Computing for Statistics (Introduction to Statistical Computing)

(My notes from this lecture are too fragmentary to post; here's the sketch.)

What should you remember from this class?

Not: my mistakes (though remember that I made them).

Not: specific packages and ways of doing things (those will change).

Not: the optimal algorithm, the best performance (human time vs. machine time).

Not even: R (that will change).

R is establishing itself as a standard for statistical computing,, but you can expect to have to learn at least one new language in the course of a reasonable career of scientific programming, and probably more than one. I was taught rudimentary coding in Basic and Logo, but really only learned to program in Scheme. In the course of twenty years of scientific programming, I have had to use Fortran, C, Lisp, Forth, Expect, C++, Java, Perl and of course R, with glances at Python and OCaml over collaborators' shoulders, to say nothing of near-languages like Unix shell scripting. This was not actually hard, just tedious. Once you have learned to think like a programmer with one language, getting competent in the syntax of another is just a matter of finding adequate documentation and putting in the time to practice it --- or finding minimal documentation and putting in even more time (I'm thinking of you, CAM-Forth). It's the thinking-like-a-programmer bit that matters.

Instead, remember rules and habits of thinking

  1. Programming is expression: take a personal, private, intuitive, irreproducible series of acts of thought, and make it public, objective, shared, explicit, repeatable, improvable. This resembles both writing and building a machine: communicative like writing, but with the impersonal, it all-must-fit check of the machine. All the other principles follow from this fact, that it is turning an act of individual thought into a shared artifact --- reducing intelligence to intellect (cf.).
  2. Top-down design
    1. What are you trying to do, with what resources, and what criteria of success?
    2. Break the whole solution down into a few (say, 2--6) smaller and simpler steps
    3. If those steps are so simply you can see how to do them, do them
    4. If they are not, treat each one as a separate problem and recurse
    This means that you do not have to solve a complex problem all at once, but rather can work on pieces, or share pieces of the work with others, with confidence that your incremental efforts will fit meaningfully into an integrated whole; reductionism is a strategy for making the whole work.
  3. Modular and functional programming
    1. Use data structures to group related values together
      1. Select, or build, data structures which make it easy to get at what you want
      2. Select, or build, data structures which hide the implementation details of no relevance to you
    2. Use functions to group related operations together
      1. Do not reinvent the wheel
      2. Do unify your approach
      3. Avoid side-effects, if possible
      4. Consider using functions as inputs to other functions
    3. Take the whole-object view when you can
      1. Cleaner
      2. More extensible
      3. Sometimes faster
      4. Essential to clarity in things like split/apply/combine
  4. Code from the bottom up; code for revision
    1. Start with the smallest, easiest bits from your outline/decomposition and work your way up
    2. Document your code as you go
      1. Much easier than going back and trying to document after
      2. Comment everything
      3. Use meaningful names
      4. Use conventional names
    3. Write tests as you go
      1. Make it easy to re-run tests
      2. Keep working on your code until it passes your tests
      3. Make it easy to add tests
      4. Whenever you find a new bug, add a corresponding test
    4. Prepare to revise
      1. Once code is working, look for common or analogous operations: refactor as general functions; conversely, look if one big function mightn't be split apart
      2. Look for data structures that show up together: refactor as one object; conversely, look if some pieces of one data structure couldn't be split off
      3. Be willing to re-do some or all of the plan, once you know more about the problem
  5. Statistical programming is different
    1. Bear in mind noise in input data
      1. Consider random tests cases, or generally avoiding neat numerical values in tests
      2. Try to have some idea of how much precision is actually possible, and avoid asking for more (wasted effort)
    2. Use probability and simulation to make test cases
      1. What should (semi-) realistic data look like?
      2. How well does the procedure work?
    3. Use probability and simulation to approximate when exact answers are intractable
      1. Monte Carlo
      2. Stochastic optimization
      3. Replacing deterministic exact problems with relaxed probabilistic counterparts
  6. Get used to working with multiple systems
    1. Use existing packages/software (unless instructed otherwise!)
    2. Use specialized tools for specialized purposes
      1. Numerical methods code (implicit but hidden)
      2. Regular expressions for text processing: grep/sed/awk, Perl, ...
      3. Databases for massive storage and retrieval
      4. Graphics, etc.
    3. Expect to have to keep learning
      1. Learning new formalisms: gets easier as you go
      2. Learning how to translate their data structures back and forth into yours
To sum up, in order of increasing importance:

Introduction to Statistical Computing

Posted at December 04, 2013 10:30 | permanent link

Three-Toed Sloth