Minimal Advice to Undergrads on Programming
I seem to be mostly teaching classes with a big computational component.
After being hit over the head a few times by the very, very wide range of
programming skill among the students, I decided to write out some advice on how
to program, with a bit of special reference
to R. This is not advice on
how to become a brilliant programmer, because I can't give such advice; I am at
best adequate for a scientist. But that's all I ask.
Corrections and suggestions are appreciated.
Update, 28 December: What follows is an updated version,
incorporating the useful suggestions
of Geet
Duggal, Derek
M. Jones, Thomas Lumley and Chris Wiggins. The original, if anyone cares,
is archived
here.
In roughly decreasing order of importance:
- Take a real programming class
- Learning enough syntax for some language to make things run without
crashing is not the same as actually learning how to think computationally.
One of the most valuable classes I ever took was CS 60A at Berkeley, which was
an introduction to programming, and so to a whole way of thinking.
(The textbook was The Structure
and Interpretation of Computer Programs.) If at all possible, take
a real programming class; if not possible, try to read a real programming
book.
- Of course by the time you are taking my class it is generally too late to
follow this advice; hence the rest of the list.
- (Actual software engineering is another discipline, over and above basic
computational thinking; that's why we have
a software engineering institute. There
is a big difference between the kind of programming I am expecting you to do,
and the kind of programming that software engineers can do.)
- Comment your code
- Comments lengthen your file, but they make it immensely easier for other
people to understand. ("Other people" includes your future self; there are few
experiences more frustrating than coming back to a program after a break only
to wonder what you were thinking.) Comments should say what each part
of the code does, and how it does it. The "what" is more important;
you can change the "how" more often and more easily.
- Every function (or subroutine, etc.) should have comments at the beginning
saying:
- what it does;
- what all its inputs are (in order);
- what it requires of the inputs and the state of the system ("presumes")'
- what side-effects it may have (e.g., "plots histogram of residuals");
- what all its outputs are (in order)
Listing what other functions or routines the function calls ("dependencies") is
optional; this can be useful, but it's easy to let it get out of date.
- You should treat "Thou shalt comment thy code" as a commandment which Moses
brought down from Mt. Sinai, written on stone by a fiery Hand. I will
treat it so when I grade you.
- RTFM
- If a function isn't doing what you think it should be doing, read the
manual. R in particular is pretty thoroughly documented. (I say this as
someone whose job used to involve programming a piece of special-purpose
hardware in a largely undocumented non-standard dialect of Forth.)
Look at (and try) the examples. Follow the cross-references. There are lots
of utility functions built into R; familiarize yourself with them.
- The utility functions I keep using: apply and its variants;
sort, order; aggregate; table;
rbind and cbind; paste.
- Start from the beginning and break it down
- Start by thinking about what you want your program to do. Then
figure out a set of slightly smaller steps which, put together, would
accomplish that. Then take each of those steps and break them down
into yet smaller ones. Keep going until the pieces you're left with are so
small that you can see how to do each of them with only a few lines of code.
Then write the code for the smallest bits, check it, once it works write the
code for the next larger bits, and so on.
- In slogan form:
- Think before you write.
- What first, then how.
- Design from the top down, code from the bottom up.
- (Not everyone likes to design code this way, and it's not in the
written-in-stone-atop-Sinai category, but there are many much worse ways to
start.)
- Break your code into many short, meaningful functions
- Since you have broken your programming problem into many small pieces, try
to make each piece a short function. (In other languages you might make them
subroutines or methods, but in R they should be functions.)
- Each function should achieve a single coherent task — its function,
if you will. The division of code into functions should respect this division
of the problem into sub-problems. More exactly, the way you break your
code into functions is how you have divided your problem.
- Each function should be short, generally less than a page of print-out.
The function should do one single meaningful thing. (Do not just break the
calculation into arbitrary thirty-line chunks and call each one a function.)
These functions should generally be separate, not nested one inside
the other.
- Using functions has many advantages:
- you can re-use the same code many times, either at different places in this program or in other programs
- the rest of your code only has to care about the inputs and outputs to the
function (its interfaces), not about the internal machinery
that turns inputs into outputs. This makes it easier to design the rest of the
program, and it means you can change that machinery without having to re-design
the rest of the program.
- it makes your code easier to test (see below), to debug, and to understand.
- Of course, every function should be commented, as described above.
- Never do the same thing twice
- Many programs involve doing the same thing multiple times, either as
iteration, or to slightly different pieces of data, or with some parameters
adjusted, etc. Never write two pieces of code to do the same job.
Never copy the same piece of code into two places in your program.
Instead, write one piece of code (generally a function; see above)
and call it twice.
- Doing this means that there is only one place to make a mistake, rather
than many. It also means that when you fix your mistake, you only have one
piece of code to correct, rather than many. (Even if you don't make a mistake,
you can always make improvements, and then there's only one piece of code you
have to work on.) It also leads to shorter, more comprehensible and more
adaptable code.
- Use meaningful names
- Unlike some older languages, R lets you give variables and functions names
of essentially arbitrary length and form. So give them meaningful names.
Writing loglikelihood, or even loglike, instead of L
makes your code a little longer, but generally a lot clearer, and it runs just
the same.
- This rule is lower down in the list because there are exceptions and
qualifications. If your code is tightly associated to a mathematical paper, or
to a field where certain symbols are conventionally bound to certain variables,
you may as well use those names (e.g., call the probability of success in a
binomial p). You should, however, explain what those symbols are in
your comments. In fact, since what you regard as a meaningful name may be
obscure to others (e.g., me, when I am grading your work), you should use
comments to explain variables in any case. Finally, it's OK to use
single-letter variable names for counters in loops (but see the advice on
iteration below).
- Check whether your program works
- It's not a enough --- in fact it's very little --- to have a program
which runs and gives you some output. It needs to be the right
output. You should therefore construct tests, which are
things that the correct program should be able to do, but an incorrect
program should not. This means that:
- you need to be able to check whether the output is right;
- you should program the test, so it checks whether the output is right (and you can easily repeat the test as many times as you need);
- your tests should be reasonably severe, so that it's hard for an incorrect program to pass them;
- your tests should help you figure out what isn't working.
- Try to write tests for the component functions, as well as the program as a
whole. That way you can see where failures are. Also, it's easier to figure
out what the right answers should be for small parts of the problem than the
whole.
- Try to write tests as very small function which call the component you're
testing with controlled input values. For instance, a test for a function
which supposedly calculates derivatives might check whether it gets the
derivative of x2 or 7e-5x
right at ten randomly-chosen points. The testing function should warn you if
the computed derivatives differ by more than a tolerance you specify from the
actual derivatives. (That's why you're using such simple functions.)
- With statistical procedures, tests can look at average or distributional
results. For example, I once wrote a program to estimate some parameters by
maximum likelihood; I could then use the fact that a likelihood ratio test
should have a chi-squared distribution to check that the estimation
part was working properly.
- Of course, unless you are very clever, or the problem
is very simple, a program could pass all your tests and still be
wrong, but a program which fails your tests is definitely not right.
- (Some people would actually advise writing your tests before
writing any actual functions. They have their reasons but I think that's
overkill for my courses.)
- Don't give up; complain!
- Sometimes you may be convinced that I have given you an impossible
programming assignment, or may not be able to get some of the class code to
work properly, etc. In these cases, do not just turn in nothing saying "I
couldn't get the data file to load". Let me know. Most likely,
either there is a trick which I forgot to mention, or I made a mistake in
writing out the assignment. Either way, you are much better off telling me and
getting help than you are turning in nothing.
- When complaining, tell me what you tried, what you expected it to do, and
what actually happened. The more specific you can make this, the better. If
possible, attach the relevant R session log and workspace to your e-mail.
- Of course, this presumes that you start the homework earlier than
the night before it's due.
- Avoid iteration
- This one is very much specific to R. Explicit iteration in R is
slow. (We could talk about the reasons for that sometime if you're
interested.) In many languages, this would be a reasonable way of summing two
vectors:
for (i in 1:length(a)) {
c[i] = a[i] + b[i]
}
In R, this is stupid. R is designed to do all this in
a single "vectorized" operation:
c = a + b
Since we need to add vectors all the time, this is an
instance of using a single function repeatedly, rather than writing the same
loop many times. (R just happens to call the function "+".) It is
also orders of magnitude faster than the explicit loop, if the vectors are at
all long.
- Try to think about vectors as vectors, and, when you need to do something
to them, manipulate all their elements at once, in parallel. R is designed to
let you do this (especially through the apply function and its
relatives), and the advantage of getting to write a+b, instead of the
loop, is that it is shorter, harder to get wrong, and emphasizes the logic
(adding vectors) over the implementation. (Sometimes this won't speed things
up much, but even then it has advantages in clarity.)
- I emphasize again, however, that the speed issue is highly specific to R,
and the way it handles iteration. A good programming class (see above) will
explain the virtues of iteration, and how to translate iteration into recursion
and vice-versa.
Manual trackback: Stephen Kinsella;
Quantum
of Wantum;
Hacker News;
Uncertain Principles;
The Shape of Code
Corrupting the Young
Posted at December 19, 2008 20:45 | permanent link