Filtering, State Estimation, and Other Forms of Signal Processing

05 Mar 2024 13:38

Etymologically, a "filter" is something you pass a fluid through, especially to purify it. In signal processing, a filter became a device or operation you pass a signal through, especially to remove what is (for your purposes) noise or irrelevancies, thereby purifying it. Its meaning (in this sense) thus diverged: one the one hand, to general transformations of signals, such as preserving their high-frequency ("high-pass") or low-frequency ("low-pass") components; on the other hand, to trying to estimating some underlying true value or state, corrupted by distortion and noise.

Perhaps the classic and most influential "filter" in the latter sense was the one proposed by Norbert Wiener in the 1940s. More specifically, Wiener considered the problem of estimating a state \( S(t) \), following a stationary stochastic processes, from observations of another, related stationary stochastic process \( X(t) \). He restricted himself to linear estimates, of the form \( \hat{S}(t) = \int_{-\infty}^{t-h}{\kappa(t-s) X(s) ds} \) (or the equivalent sum in discrete time), and provided a solution, i.e., a function \( \kappa(u) \), that minimized the expected squared error \( \mathbb{E}[(S(t) - \hat{S}(t))^2] \). Notice that if \( h \) is positive, this combines estimating the state with extrapolating it into the future, while if \( h \) is negative, one is estimating the state with a lag. Wiener's solution did not assume that the true relationship between \( S \) and \( X \) is linear, or that either process shows linear dynamics --- just that the processes are stationary, and one wanted a linear estimator.

Later, in the late 1950s, Kalman, and Kalman and Bucy, tackled the situation where \( S(t) \) follows a linear, Gaussian, discrete-time Markov process, so \( S(t+1) = a S(t) + \epsilon(t) \) for some independent Gaussian noise variables \( \epsilon \), and the observable \( X(t) \) is linearly related to \( S(t) \), \( X(t) = b X(t) + \eta(t) \). They solved for the conditional distribution of \( S(t) \) given \( X(1), \ldots X(t) \), say \( S(t)|X(1:t) \). This is again a Gaussian, whose mean and variance can be expressed in closed form given the parameters, and the mean and variance of \( S(t-1)|X(1:t-1) \). The recursive computation of this conditional distribution came to be called the Kalman filter. The Kalman smoother came to refer to the somewhat more involved computation of \( S(t)|X(1:n) \), \( n > t \) --- that is, going back and refining the estimate of the unobserved state using later observations. This seems to be the root of distinguishing two ways of estimating the states of hidden Markov models, filtering, i.e., getting \( S(t)|X(1:t) \), and smoothing, i.e., getting \( S(t)|X(1:n) \).

The Kalman filter also, unlike the Wiener filter, relied strongly on assumptions about the data-generating process, namely that it really was a linear, Gaussian, hidden-Markov or state-space process. The fragility of the latter assumptions spurred a lot of work, over many years, seeking to either repeat the same pattern under different assumptions, or to use Kalman's solution as some kind of local approximation.

Separately (so far as I can tell from reading the literature), people who were interested in discrete-state Markov chains, observed through noise, considered the same problem of estimating the state from observations. A recursive estimate of \( S(t)|X(1:t) \) came to be called the "forward algorithm". (Applying the exact same ideas to linear-Gaussian HMMs would give you the Kalman filter, though I don't know when people realized that.) The forward algorithm, in turn, served as a component in a more elaborate recursive algorithm for \( S(t)|X(1:n) \), called the "backward algorithm".

The forward algorithm is actually very pretty, and I've just taught it, so I'll sketch the derivation. Assume we've got \( \Pr(S_t|X(1:t)) \), and know \( \Pr(S(t+1)|S(t)) \) (all we need for the dynamics, since the state \( S \) is a Markov process) and \( \Pr(X(t)|S(t)) \) (all we need for the observations, since \( X \) is a hidden-Markov process). First, we extrapolate the state-estimate forward in time: \begin{eqnarray*} \Pr(S(t+1)=s|X(1:t)) & = & \sum_{r}{\Pr(S(t+1)=s, S(t)=r|X(1:t))}\\ & = & \sum_{r}{\Pr(S(t+1)=s|S(t)=r, X(1:t))\Pr(S(t)=r|X(1:t))}\\ & = & \sum_{r}{\Pr(S(t+1)=s|S(t)=r)\Pr(S(t)=r|X(1:t))} \end{eqnarray*} Next, we calculate the predictive distribution: \begin{eqnarray*} \Pr(X(t+1)=x|X(1:t)) & = & \sum_{s}{\Pr(X(t+1)=x|S(t+1)=s, X(1:t))\Pr(S(t+1)=s| X(1:t))}\\ & = & \sum_{s}{\Pr(X(t+1)=x|S(t)=s)\Pr(S(t+1)=s|X(1:t))} \end{eqnarray*} Finally, we use Bayes's rule: \begin{eqnarray*} \Pr(S(t+1)=s|X(1:t+1)) & = & \Pr(S(t+1)=s|X(1:t), X(t+1))\\ & = & \frac{\Pr(S(t+1)=s, X(t+1)|X(1:t))}{\Pr(X(t+1)|X(1:t))}\\ & = & \frac{\Pr(X(t+1)|S(t+1)=s)\Pr(S(t+1)=s|X(1:t))}{\Pr(X(t+1)|X(1:t))}\\ \end{eqnarray*} It's worth noting here that the exact same idea works if \( S(t) \) and/or \( X(t) \) are continuous rather than discrete --- just replace probabilities with probability densities, and sums with integrals as appropriate.

A purely formal solution to finding \( S(t)|X(1:t) \) in arbitrary nonlinear processes, even in continuous time, was worked out by the 1960s; it was, again, a recursion which implemented Bayes's rule. Unfortunately, with continuous states, it's pretty much intractable in general, since you'd need to maintain a probability density over possible states (and then integrate it, twice). This only got people more interested in the special cases which admitted closed forms (like the Kalman filter), or, again, to approximations based on those closed forms.

A minor revolution from the 1990s --- I forget the exact dates and I'm under-motivated to look them up --- was to realize that the exact nonlinear filter could be approximated by Monte Carlo. Look at the way I derived the forward algorithm above. Suppose we didn't know the exact distribution \( \Pr(S(t)|X(1:t)) \), but we did have a sample \( R_1, R_2, \ldots R_m \) drawn from it. We could take each of these and (independently) apply the Markov process's transition to it, to get new states, at time \( t+2 \), say \( S_1, S_2, \ldots S_m \). These values constitute a sample from \( \Pr(S(t+1)|X(1:t)) \). The model tells us \( \Pr(X(t+1)=x|S(t+1)=S_i) \) for each \( S_i \). Averaging those distributions over the samples gives us an approximation to \( \Pr(X(t+1)=x|X(1:t)) \). If we re-sample the \( S_i \)'s with probabilities proportional to \( \Pr(X(t+1)=x(t+1)|S(t+1)=S_i) \), we are left with an approximate sample from \( \Pr(S(t+1)|S(1:t+1)) \). This is the particle filter, the samples being "particles". (The mathematical sciences are not known for consistent, well-developed metaphors.)

All of this presumed (like the Kalman filter) that the parameters of the process where known, and all one needed to estimate was the particular realization of the hidden state \( S(t) \). (Even the Wiener filter presumes a knowledge of the covariance functions, though no more.) But the forward and backward algorithms can be used as components in an algorithm for maximum likelihood estimation of the parameters of an HMM, variously called the "forward-backward", "Baum-Welch" or "expectation-maximization" algorithm. (In fact, the forward algorithm gives us \( \Pr(X(t+1)=x|X(1:t)) \) for each \( t \). Multiplying these together, \( \prod_{t}{\Pr(X(t+1)=x|X(1:t))} \), clearly gives the probability that the model assigns to the whole observed trajectory \( X(1:n) \). Since this is the "likelihood" of the model (in the sense statisticians use that word), once we've done "filtering", we can use all the common likelihood-based statistical techniques to estimate the model. (Of course, whether likelihood works is another issue...)

Independent component analysis.