Directed Information and Transfer Entropy

31 Oct 2023 20:43

These are closely related concepts, so much so that they're easily run together. (One reason I wrote this notebook was to force myself to straighten out the definitions in my head.)

Start with two random sequences \( X_t \) and \( Y_t \). The directed information from \( X_{1:n} \) to \( Y_{1:n} \) is the sum, over time, of how much information \( X_{1:t} \) gives about \( Y_t \) conditional on the history \( Y_{1:t-1} \), \[ \begin{eqnarray} I[X_{1:n} \rightarrow Y_{1:n}] & = & \sum_{t=1}^{n}{I[X_{1:t}; Y_{t}|Y_{1:t-1}]}\\ & = & \sum_{t=1}^{n}{H[Y_{t}|Y_{1:t-1}] - H[Y_{t}|X_{1:t}, Y_{1:t-1}]} \end{eqnarray} \] writing \( I[] \) for (conditional) mutual information mutual information, and \( H[] \) for (conditional) entropy as usual.

This definition comes from Massey (1990), eq. 7, and he's usually given credit for it. This is right, so far as I can tell, but in the text of his paper he calls it a "slight modification of [a definition] introduce by Marko ... seventeen years ago". Having tracked down Marko (1973) in the course of writing this notebook, it appears that what Massey had in mind is Marko's "directed transinformation", eq. 9 (and the parallel eq. 12 for the "transinformation" in the other direction). In the present notation this would read \[ \lim_{t\rightarrow\infty}{H[Y_{t}|Y_{1:t-1}] - H[Y_{t}|Y_{1:t-1}, X_{1:t-1}]} = \lim_{t\rightarrow\infty}{I[Y_{t};X_{1:t-1}|Y_{1:t-1}]} \] This is almost the same as the summands in Massey's definition, but Marko was taking the limit as \( n\rightarrow\infty \), and looking at the information in about \( Y_t \) in strictly earlier random variables, whereas Massey considers also the extra information \( X_t \) gives about \( Y_t \) (in the context of all the earlier \( X \)s and \( Y \)s).

By contrast, the transfer entropy (after Schreiber [2000], eq. 4) with lag \( l \) is \[ T(X \rightarrow Y) = H[Y_{t}|Y_{t-l:t-1}] - H[Y_{t}|X_{t-l:t-1}, X_{t-l:t-1}] = I(Y_t; X_{t-l:t-1}| Y_{t-l:t-1}) \] The biggest difference between this expression and directed information is that directed information is a sum of transfer-entropy-like terms over time. The summands are non-negative, so the directed information grows with \( n \), even for a stationary process. On the other hand, the definition of transfer entropy has a finite order or lag \( l \) built in; if we hold that fixed it'll be constant for a stationary process, but the summands in directed information correspond to letting the lag grow over time. There is a final subtle difference, which is whether we're looking at conditional information on \( Y_t \) in \( X_{1:t-1} \) (transfer entropy) or in \( X_{1:t} \) (directed information). In this respect, though, the transfer entropy looks much more like Marko's directed transinformation. In fact, if the process is a Markov chain of order \( \leq l \), then Marko's directed transinformation will coincide with Schreiber's lag-\( l \) transfer entropy.

(Before moving on, I cannot help protesting that the "transfer entropy" is not an entropy but an information.)

These concepts, or something in their neighborhood, are basically what Granger should have meant when he defined what's come to be called "Granger causality"; defining that in terms of linear predictors confused the idea with what you could practically calculate in the 1960s. (Of course Granger was a much better statistician than I will ever be.) My impulse is to not find any of these notions especially compelling formalizations of causality, because they seem to ignore all the issues about confounding and inference-vs-influence that are really essential, though I get that they're trying.

However, Maxim Raginsky's paper on connections between directed information (specifically) and graphical causal models shakes my confidence in this last opinion. In fact, if I were smarter, I would put a lot of effort into mastering Raginsky's paper and seeing how far all of causal inference can be reformulated information-theoretically. In particular, it should be possible to say something enlightening about "when would controlling for an imperfect proxy help?" and "when does the Wang-Blei deconfounder work anyway?"

(Thanks to Giles Harper-Donnelly for helpfully point out some LaTeX typos.)