Computing for Statistics (Introduction to Statistical Computing)
(My notes from this lecture are too fragmentary to post; here's the sketch.)
What should you remember from this class?
Not: my mistakes (though remember that I made them).
Not: specific packages and ways of doing things (those will change).
Not: the optimal algorithm, the best performance (human time vs. machine time).
Not even: R (that will change).
R is establishing itself as a standard for statistical computing,, but you
can expect to have to learn at least one new language in the course of a
reasonable career of scientific programming, and probably more than one. I was
taught rudimentary coding in Basic and Logo, but really only learned
to program in Scheme. In
the course of twenty years of scientific programming, I have had to use
Fortran, C, Lisp, Forth, Expect, C++, Java, Perl and of course R, with glances
at Python and OCaml over collaborators'
shoulders, to say nothing of near-languages like Unix shell scripting. This
was not actually hard, just tedious. Once you have learned to think
like a programmer with one language, getting competent in the syntax of another
is just a matter of finding adequate documentation and putting in the time to
practice it --- or finding minimal documentation and putting in even more time
(I'm thinking of
you, CAM-Forth). It's
the thinking-like-a-programmer bit that matters.
Instead, remember rules and habits of thinking
- Programming is expression: take a personal, private,
intuitive, irreproducible series of acts of thought, and make it public,
objective, shared, explicit, repeatable, improvable. This resembles both
writing and building a machine: communicative like writing, but with the
impersonal, it all-must-fit check of the machine. All the other principles
follow from this fact, that it is turning an act of individual thought into a
shared artifact --- reducing intelligence to intellect
- Top-down design
This means that you do not have to solve a complex problem all at once, but
rather can work on pieces, or share pieces of the work with others, with
confidence that your incremental efforts will fit meaningfully into an
is a strategy for making the whole work.
- What are you trying to do, with what resources, and what criteria of
- Break the whole solution down into a few (say, 2--6) smaller and simpler steps
- If those steps are so simply you can see how to do them, do them
- If they are not, treat each one as a separate problem and recurse
- Modular and functional programming
- Use data structures to group related values together
- Select, or build, data structures which make it easy to get at what you want
- Select, or build, data structures which hide the implementation details of no relevance to you
- Use functions to group related operations together
- Do not reinvent the wheel
- Do unify your approach
- Avoid side-effects, if possible
- Consider using functions as inputs to other functions
- Take the whole-object view when you can
- More extensible
- Sometimes faster
- Essential to clarity in things like split/apply/combine
- Code from the bottom up; code for revision
- Start with the smallest, easiest bits from your outline/decomposition and work your way up
- Document your code as you go
- Much easier than going back and trying to document after
- Comment everything
- Use meaningful names
- Use conventional names
- Write tests as you go
- Make it easy to re-run tests
- Keep working on your code until it passes your tests
- Make it easy to add tests
- Whenever you find a new bug, add a corresponding test
- Prepare to revise
- Once code is working, look for common or analogous operations: refactor as general functions; conversely, look if one big function mightn't be split apart
- Look for data structures that show up together: refactor as one object; conversely, look if some pieces of one data structure couldn't be split off
- Be willing to re-do some or all of the plan, once you know more about the problem
- Statistical programming is different
- Bear in mind noise in input data
- Consider random tests cases, or generally avoiding neat numerical values in tests
- Try to have some idea of how much precision is actually possible, and avoid asking for more (wasted effort)
- Use probability and simulation to make test cases
- What should (semi-) realistic data look like?
- How well does the procedure work?
- Use probability and simulation to approximate when exact answers are intractable
- Monte Carlo
- Stochastic optimization
- Replacing deterministic exact problems with relaxed probabilistic counterparts
- Get used to working with multiple systems
- Use existing packages/software (unless instructed otherwise!)
- Use specialized tools for specialized purposes
- Numerical methods code (implicit but hidden)
- Regular expressions for text processing: grep/sed/awk, Perl, ...
- Databases for massive storage and retrieval
- Graphics, etc.
- Expect to have to keep learning
- Learning new formalisms: gets easier as you go
- Learning how to translate their data structures back and forth into yours
To sum up, in order of increasing importance:
- Plan for randomness
- Take the whole-object view when you can
- Think in terms of functions and objects
- Practice top-down design
- Code with an eye to debugging and revision
- Remember that programming is expression
Introduction to Statistical Computing
Posted at December 04, 2013 10:30 | permanent link