Directed Information and Transfer Entropy

Last update: 07 Jul 2025 13:47
First version:

meta_create_date: 13 August 2021

These are closely related concepts, so much so that they're easily run together. (One reason I wrote this notebook was to force myself to straighten out the definitions in my head.)

Start with two random sequences \( X_t \) and \( Y_t \). The directed information from \( X_{1:n} \) to \( Y_{1:n} \) is the sum, over time, of how much information \( X_{1:t} \) gives about \( Y_t \) conditional on the history \( Y_{1:t-1} \), \[ \begin{eqnarray} I[X_{1:n} \rightarrow Y_{1:n}] & = & \sum_{t=1}^{n}{I[X_{1:t}; Y_{t}|Y_{1:t-1}]}\\ & = & \sum_{t=1}^{n}{H[Y_{t}|Y_{1:t-1}] - H[Y_{t}|X_{1:t}, Y_{1:t-1}]} \end{eqnarray} \] writing \( I[] \) for (conditional) mutual information mutual information, and \( H[] \) for (conditional) entropy as usual.

This definition comes from Massey (1990), eq. 7, and he's usually given credit for it. This is right, so far as I can tell, but in the text of his paper he calls it a "slight modification of [a definition] introduce by Marko ... seventeen years ago". Having tracked down Marko (1973) in the course of writing this notebook, it appears that what Massey had in mind is Marko's "directed transinformation", eq. 9 (and the parallel eq. 12 for the "transinformation" in the other direction). In the present notation this would read \[ \lim_{t\rightarrow\infty}{H[Y_{t}|Y_{1:t-1}] - H[Y_{t}|Y_{1:t-1}, X_{1:t-1}]} = \lim_{t\rightarrow\infty}{I[Y_{t};X_{1:t-1}|Y_{1:t-1}]} \] This is almost the same as the summands in Massey's definition, but Marko was taking the limit as \( n\rightarrow\infty \), and looking at the information in about \( Y_t \) in strictly earlier random variables, whereas Massey considers also the extra information \( X_t \) gives about \( Y_t \) (in the context of all the earlier \( X \)s and \( Y \)s).

By contrast, the transfer entropy (after Schreiber [2000], eq. 4) with lag \( l \) is \[ T(X \rightarrow Y) = H[Y_{t}|Y_{t-l:t-1}] - H[Y_{t}|X_{t-l:t-1}, X_{t-l:t-1}] = I(Y_t; X_{t-l:t-1}| Y_{t-l:t-1}) \] The biggest difference between this expression and directed information is that directed information is a sum of transfer-entropy-like terms over time. The summands are non-negative, so the directed information grows with \( n \), even for a stationary process. On the other hand, the definition of transfer entropy has a finite order or lag \( l \) built in; if we hold that fixed it'll be constant for a stationary process, but the summands in directed information correspond to letting the lag grow over time. There is a final subtle difference, which is whether we're looking at conditional information on \( Y_t \) in \( X_{1:t-1} \) (transfer entropy) or in \( X_{1:t} \) (directed information). In this respect, though, the transfer entropy looks much more like Marko's directed transinformation. In fact, if the process is a Markov chain of order \( \leq l \), then Marko's directed transinformation will coincide with Schreiber's lag-\( l \) transfer entropy.

(Before moving on, I cannot help protesting that the "transfer entropy" is not an entropy but an information.)

These concepts, or something in their neighborhood, are basically what Granger should have meant when he defined what's come to be called "Granger causality"; defining that in terms of linear predictors confused the idea with what you could practically calculate in the 1960s. (Of course Granger was a much better statistician than I will ever be.) My impulse is to not find any of these notions especially compelling formalizations of causality, because they seem to ignore all the issues about confounding and inference-vs-influence that are really essential, though I get that they're trying.

However, Maxim Raginsky's paper on connections between directed information (specifically) and graphical causal models shakes my confidence in this last opinion. In fact, if I were smarter, I would put a lot of effort into mastering Raginsky's paper and seeing how far all of causal inference can be reformulated information-theoretically. In particular, it should be possible to say something enlightening about "when would controlling for an imperfect proxy help?" and "when does the Wang-Blei deconfounder work anyway?"

(Thanks to Giles Harper-Donnelly for helpfully point out some LaTeX typos.)

Ioannis Kontoyiannis, Maria Skoularidou, "Estimating the Directed Information and Testing for Causality", IEEE Transactions on Information Theory 62 (2016): 6053--6067, arxiv:1507.01234
Hans Marko, "The Birdirectional Communication Theory: A Generalization of Information Theory", IEEE Transactions on Communications 21 (1973): 1345--1351
James L. Massey, "Causality, Feedback and Directed Information", Proceedings of the 1990 International Symposium on Information Theory and Its Applictations [PDF]
Maxim Raginsky, "Directed information and Pearl's causal calculus", pp. 958--965 in S. Meyn and B. Hajek (eds.), Proceedings of the 49th Annual Allerton Conference on Communication, Control and Computing (IEEE, 2011), arxiv:1110.0718
Thomas Schreiber, "Measuring Information Transfer", Physical Review Letters 85 (2000): 461--464, arxiv:nlin/0001042

P. O. Amblard and O. J. J. Michel, "On directed information theory and Granger causality graphs", arxiv:1002.1446
L. Barnett and A. K. Seth, "Dynamical independence: Discovering emergent macroscopic processes in complex dynamical systems", Physical Review E 108 (2023): 014304 [Preliminary comments after a first reading. This is close enough to my interests that I need to re-read it carefully.]
Terry Bossomaier, Lionel Barnett, Michael Harré, Joseph T. Lizier, An Introduction to Transfer Entropy: Information Flow in Complex Systems
Charalambos D. Charalambous, Photios A. Stavrou, "Directed Information on Abstract Spaces: Properties and Variational Equalities", arxiv:1302.3971
Jalal Etesami , Negar Kiyavash , Todd Coleman, "Learning Minimal Latent Directed Information Polytrees ", Neural Computation 28 (2016): 1723--1768
Jiantao Jiao, Haim H. Permuter, Lei Zhao, Young-Han Kim, Tsachy Weissman, "Universal Estimation of Directed Information", IEEE Transactions on Information Theory 59 (2013): 6220--6242, arxiv:1201.2334
Haim H. Permuter, Young-Han Kim and Tschay Weissman, "Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing", IEEE Transactions on Information Theory 57 (2011): 3248--3259, arxiv:0912.4872
Christopher J. Quinn, Negar Kiyavash, Todd P. Coleman, "Directed Information Graphs", arxiv:1204.2003
T. Weissman, Young-Han Kim and H. H. Permuter, "Directed Information, Causal Estimation, and Communication in Continuous Time", IEEE Transactions on Information Theory 59 (2013): 1271--1287
Bertrand Wechsler, Dan Eilat, Nicolas Limal, "Information Flow Decomposition in Feedback Systems: General Case Study", arxiv:1404.0760