## Directed Information and Transfer Entropy

*13 Aug 2021 20:54*

Yet Another Inadequate Placeholder

These are closely related concepts, so much so that they're easily run together. (One reason I wrote this notebook was to force myself to straighten out the definitions in my head.)

Start with two random sequences
\( X_t \) and \( Y_t \). The **directed information** from
\( X_{1:n} \) to \( Y_{1:n} \) is the sum, over time, of how much information
\( X_{1:t} \) gives about \( Y_t \) conditional on the history \( Y_{1:t-1} \),
\[
\begin{eqnarray}
I[X_{1:n} \rightarrow Y_{1:n}] & = & \sum_{t=1}^{n}{I[X_{1:t}; Y_{t}|Y_{1:t-1}]}\\
& = & \sum_{t=1}^{n}{H[Y_{t}|Y_{1:t-1}] - H[Y_{t}|X_{1:t}, Y_{1:t-1}]}
\end{eqnarray}
\]
writing \( I[] \) for (conditional) mutual information mutual information,
and \( H[] \) for (conditional) entropy as usual.

This definition comes from Massey (1990), eq. 7, and he's usually given
credit for it. This is right, so far as I can tell, but in the text of his
paper he calls it a "slight modification of [a definition] introduce by Marko
... seventeen years ago". Having tracked down Marko (1973) in the course of
writing this notebook, it appears that what Massey had in mind is Marko's
"directed transinformation", eq. 9 (and the parallel eq. 12 for the
"transinformation" in the other direction). In the present notation this
would read
\[
\lim_{t\rightarrow\infty}{H[Y_{t}|Y_{1:t-1}] - H[Y_{t}|Y_{1:t-1}, X_{1:t-1}]} = \lim_{t\rightarrow\infty}{I[Y_{t};X_{1:t-1}|Y_{1:t-1}]}
\]
This is *almost* the same as the summands in Massey's definition, but
Marko was taking the limit as $n\rightarrow\infty$, and looking at the
information in about \( Y_t \) in *strictly* earlier random variables,
whereas Massey considers also the extra information \( X_t \) gives about \(
Y_t \) (in the context of all the earlier \( X \)s and \( Y \)s).

By contrast, the **transfer entropy** (after Schreiber [2000], eq. 4) with lag \( l \) is
\[
T(X \rightarrow Y) = H[Y_{t}|Y_{t-1:t-l}] - H[Y_{t}|X_{t-1:t-l}, X_{t_1:t-l}] = I(Y_t; X_{t-1;t-l}| Y_{t-1:t-l})
\]
The biggest difference between this expression and directed information is that
directed information is a sum of transfer-entropy-like terms over time. The
summands are non-negative, so the directed information grows with \( n \), even
for a stationary process. On the other hand, the definition of transfer
entropy has a finite order or lag \( l \) built in; if we hold that fixed it'll
be constant for a stationary process, but the summands in directed information
correspond to letting the lag grow over time. There is a final subtle
difference, which is whether we're looking at conditional information on \( Y_t
\) in \( X_{1:t-1} \) (transfer entropy) or in \( X_{1:t} \) (directed
information). In this respect, though, the transfer entropy looks much more
like Marko's directed transinformation. In fact, if the process is a Markov
chain of order \( \leq l \), then Marko's directed transinformation will
coincide with Schreiber's lag-\( l \) transfer entropy.

(Before moving on, I cannot help protesting that the "transfer entropy" is not an entropy but an information.)

These concepts, or something in their neighborhood, are basically what
Granger *should* have meant when he defined what's come to be called
"Granger causality"; defining that in terms of linear predictors confused the
idea with what you could practically calculate in the 1960s. (Of course
Granger was a much better statistician than I will ever be.) My impulse is to
not find any of these notions especially compelling formalizations
of *causality*, because they seem to ignore all the issues about
confounding and inference-vs-influence that are really essential, though I get
that they're *trying*.

However, Maxim Raginsky's paper on connections between directed information
(specifically) and graphical causal models shakes my confidence in this last
opinion. In fact, if I were smarter, I would put a *lot* of effort into
mastering Raginsky's paper and seeing how far all of causal inference can be
reformulated information-theoretically. In particular, it should be possible
to say something enlightening about "when would controlling for an imperfect
proxy help?" and "when does the Wang-Blei deconfounder work anyway?"

(Thanks to Giles Harper-Donnelly for helpfully point out some LaTeX typos.)

*See*
Information Flow;
Information Theory;
Graphical Causal Models

- Recommended (grossly inadequate to the literature):
- Hans Marko, "The Birdirectional Communication Theory: A Generalization of Information Theory", IEEE Transactions on Communications
**21**(1973): 1345--1351 - James L. Massey, "Causality, Feedback and Directed Information", Proceedings of the 1990 International Symposium on Information Theory and Its Applictations [PDF]
- Maxim Raginsky, "Directed information and Pearl's causal calculus", pp. 958--965 in S. Meyn and B. Hajek (eds.), Proceedings of the 49th Annual Allerton Conference on Communication, Control and Computing (IEEE, 2011), arxiv:1110.0718
- Thomas Schreiber, "Measuring Information Transfer",
Physical Review Letters
**85**(2000): 461--464, arxiv:nlin/0001042

- To read:
- P. O. Amblard and O. J. J. Michel, "On directed information theory and Granger causality graphs", arxiv:1002.1446
- Terry Bossomaier, Lionel Barnett, Michael Harré, Joseph T. Lizier, An Introduction to Transfer Entropy: Information Flow in Complex Systems
- Charalambos D. Charalambous, Photios A. Stavrou, "Directed Information on Abstract Spaces: Properties and Variational Equalities", arxiv:1302.3971
- Jalal Etesami , Negar Kiyavash , Todd Coleman, "Learning Minimal Latent Directed Information Polytrees ", Neural Computation
**28**(2016): 1723--1768 - Jiantao Jiao, Haim H. Permuter, Lei Zhao, Young-Han Kim, Tsachy Weissman, "Universal Estimation of Directed Information",
IEEE Transactions on
Information Theory
**59**(2013): 6220--6242, arxiv:1201.2334 - Ioannis Kontoyiannis, Maria Skoularidou, "Estimating the Directed
Information and Testing for Causality",
IEEE Transactions on
Information Theory
**62**(2016): 6053--6067, arxiv:1507.01234 - Haim H. Permuter, Young-Han Kim and Tschay Weissman, "Interpretations of Directed Information in Portfolio Theory, Data Compression,
and Hypothesis Testing", IEEE Transactions on Information Theory
**57**(2011): 3248--3259, arxiv:0912.4872 - Christopher J. Quinn, Negar Kiyavash, Todd P. Coleman, "Directed Information Graphs", arxiv:1204.2003
- T. Weissman, Young-Han Kim and H. H. Permuter, "Directed Information, Causal Estimation, and Communication in Continuous Time", IEEE Transactions on Information Theory
**59**(2013): 1271--1287 - Bertrand Wechsler, Dan Eilat, Nicolas Limal, "Information Flow Decomposition in Feedback Systems: General Case Study", arxiv:1404.0760