(This article was originally published at Three-Toed Sloth , and syndicated at StatsBlogs.)

Wikipedia is a tremendous accomplishment and an invaluable resource. It is
also *highly* unreliable. Since I have just spent a bit of time on
the second fork, let me record it here for posterity.

A reader of
my notebook on
information theory wanted to know whether I made a mistake there when I
said that "self-information" is, in information theory, just an alternative
name for the entropy of a randoSabaticalm variable. After all, he said,
the Wikipedia article
on self-information (version
of 22
July 2016) says that the self-information of an *event* (not a
random variable) is the negative log probability of that event^{*}. What follows
is modified from my reply to my correspondent.

(1) my usage is the one I learned from my teachers and textbooks; (2) the Wikipedia page is the first time I have ever seen this other usage; and (3) the references given by the Wikipedia page do not actually support the usage it advocates; only one of them even uses the term "self-information", and that supports my usage, rather than the page's.

To elaborate on (3), the Wikipedia page cites as references (a)
a paper by Meila on
comparing clusterings, (b) Cover and Thomas's standard textbook, and (c)
Shannon's original paper. (a) is a good paper, but in fact never uses the
phrase "self-information" (or "self information", etc.). For (b), the
Wikipedia page cites p. 20 of the first edition from 1991, which I no longer
have; but in the 2nd edition, "self-information" appears just once, on p. 21,
as a synonym for entropy ("This is the reason that entropy is sometimes
referred to as *self-information*"; their italics). As for (c),
"self-information" does not appear anywhere in Shannon's paper (nor, more
remarkably, does "mutual information"), and in fact Shannon gives no name to
the quantity \( -\log{p(x)} \).

There are also three external links on the page: the first
("Examples of surprisal
measures") only uses the word "surprisal". The
second, "
'Surprisal' entry in a glossary of molecular information theory", again
only uses the word "surprisial" (and that glossary has no entry for
"self-information"). The
third, "Bayesian Theory of
Surprise", does not use either word, and in fact defines "surprise" as the
KL divergence between a prior and a posterior distribution, not using
$-\log{p(x)}$ at all. The Wikipedia page *is* right that $-\log{p(x)}$
is sometimes called "surprisal", though "negative log likelihood" is much more
common in statistics, and some more mathematical authors (e.g., R. M. Gray,
Entropy and Information
Theory [2nd ed., Springer, 2011], p. 176) prefer "entropy density".
But, as I said, I have never seen anyone else call it "self-information". I am
not sure where this strange usage began, but I suspect it's something some
Wikipedian just made up. The error seems to go back to the first version of
the page on self-information,
from 2004
(which cites no references or sources at all). It has survived all 136
subsequent revisions. None of those revisions, it appears, ever involved
checking whether the original claim was right, or indeed even whether the
external links and references actually supported it.

I could, of course, try to fix this myself, but it would involve replacing the page with something about one sentence long, saying "In information theory, 'self-information' is a synonym for the entropy of a random variable; it is the expected value of the 'surprisal' of a random event, but is not the same as the surprisal." Leaving aside the debate about whether a topic which can be summed up in a sentence deserves a page of its own; I am pretty certain that if I didn't waste a lot of time defending the edit, it would swiftly be reverted. I have better things to do with my time**

How many other Wikipedia pages are based on similar mis-understandings and inventions, I couldn't begin to say. Nor could I pretend to guess whether Wikipedia has more such errors than traditional encyclopedias.

*: The (Shannon) entropy of a
random variable \( X \), with probability mass function \( p(x) \), is of
course just \( H[X] \equiv -\sum_{x}{p(x) \log{p(x)}} \). The conditional
entropy of one random variable \( Y \) given a *particular* value of
another is just the entropy of the conditional distribution, \( H[Y|X=x] \equiv
-\sum_{y}{p(y|x) \log{p(y|x)}} \). The conditional entropy is the average of
this, \( H[Y|X] \equiv -\sum_{x}{p(x) H[Y|X=x] } \). The information \( X \)
contains about \( Y \) is the (average) amount by which conditioning on \( X \)
reduces the entropy of \( Y \), \( I[X;Y] \equiv H[Y] - H[Y|X] \). It turns
out that this is always equal to \( H[X] - H[X|Y] = I[Y;X] \), hence "mutual
information". The term "self-information" is sometimes used by contrast for \(
H[X] \), which you can convince yourself is also equal to \( I[X;X] \).
Wikipedia, by contrast, is claiming that "self-information" refers to the
quantity \( -\log{p(x)} \), so it's a property of a particular outcome or event
\( x \), rather than of a probability distribution or random
variable. ^

**: I realize that I *may* still
have enough of an online reputation that by posting this, others will fix the
article and keep it fixed. ^

**Please comment on the article here:** **Three-Toed Sloth **