Saturday 30 January 2016

Kolmogorov, Shannon and ... Shakespeare: different concepts of information

(Tuesday, January 18, 2011, 7:57 PM)



In algorithmic information theory (a subfield of computer science), the Kolmogorov complexity of an object (so called after the name of the Russian mathematician Andrey Kolmogorov), such as a piece of text, is a measure of the computational resources needed to specify the object.

Kolmogorov complexity is also known as descriptive complexity, Kolmogorov-Chaitin complexity, stochastic complexity, algorithmic entropy, or program-size complexity.

For example, consider the following two strings of length 64, each containing only lowercase letters, numbers, and spaces:

abababababababababababababababababababababababababababababababab

4c1j5b2p0cv4w1x8rx2y39umgw5q85s7uraqbjfdppa0q7nieieqe9noc4cvafzf

The first string has a short English-language description, namely "ab 32 times", which consists of 11 characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, which has 64 characters.

Historically prior, but for many aspects similar to  Kolmogorov complexity is the notion of Shannon information (see Shannon Information and Kolmogorov Complexity, by Peter Grünwald and Paul Vitányi, July 22, 2010, PDF at homepages.cwi.nl)

It is critical not to confuse two very different concepts of information:

A Information-carrying capacity (aka "Shannon information"), which is the maximum capacity of an information storage medium (Examples: [IT] the storage capacity of a HDD, normally expressed in bits and/or bytes; [Cell] the storage capacity of the cell DNA, expressed in # of bases)

B Specified information, or functional information or coded information or information content. The maximum theoretical value (limit) coincides with A, but , in fact, due mainly to  redundancy, and actual memory utilization, this is never the case.  (Examples: [IT] actual utilization with programs, DB, and other data of the storage capacity of a HDD; [Cell]  the actual amount of information carried by the cell DNA, in terms of #  of genes and total # of bases corresponding to the encoded genes).

The Information-carrying capacity ("Shannon information") of a "book capable of listing a random string of 10 million random  alpha-numeric characters" is, obviously 10 million characters, whereas the actual information content depends, essentially on the redundancy of the code, for instance English,  or more specifically, Shakespearean English. That the information content of a meaningful text is (considerably) lower than the actual amount of characters used. The intuitive counter-proof is that such text can be compressed and reconstituted (typically with a stochastic - NOT deterministic - algorithm that takes into account the statistical redundancy properties of the language), whereas a purely random text of the same size (10 Mb) cannot be compressed.

Let's now suppose that Shakespeare's The Tragedy of Hamlet, Prince  of Denmark, contains about 10 million characters.

While it is certainly true that the "information contents" of Hamlet's text is somewhere between that of a text of the same size fully determined by some (more or less complex algorithm) and that of a fully random text of the same size (which is equal to the Information-carrying capacity) ONLY if the text obeys some code (in Hamlet's case the rules of the English language) does it have not only information, but functional information, which is the same as meaningful information.


For instance, the string of characters ...

METHINKS␢IT␢IS␢LIKE␢A␢WEASEL (where: ␢ = blank space)

... is meaningful in English (William Shakespeare, Hamlet, Act 3, Scene 2) ...

... but meaningless in another language, for instance in Italian.

Likewise, with reference to the "biological context", we can say that the "meaning" of a meaningful sequence of DNA (a gene) is the protein (or other cellular structure and/or operation) the it specifies.

No comments:

Post a Comment