After four people nearly die from poisoning, Don learns someone is tampering with nonprescription drugs. When the perpetrator publishes a manifesto in a local paper, Don is led to believe the person may be a former employee of the drug company that manufactured the drug.

In this episode, Charlie uses information theory and information entropy to help understand the manifesto and probabilistic graph theory in an attempt to determine where the perpetrator is likely to be found.

Charlie describes Information Theory as the branch of mathematics that formalizes the study of how to ``get a message from point A to point B'' and explains that this theory provides a good working definition of what the information content of a message is. As there are different ways to hear the same song for example, this definition should not depend on the form of the message.

A very common mistake is to believe that information theory is about what messages mean, that is to say to believe that content means here *quality* rather than *quantity*.

Indeed, tools like *information entropy* allow mathematicians to give meaning to and answer sentences like ``how much information is present in this message'', just like a scale tells you how much something weighs without saying anything about whether it's steel or wood, for example.

Information theory is generally considered to have been founded in 1948 by Claude Shannon in his seminal work, A Mathematical Theory of Communication. The central paradigm of classical information theory is the engineering problem of the transmission of information over a noisy channel.

Applications of fundamental topics of information theory include lossless data compression (e.g. ZIP files), lossy data compression (e.g. MP3s), and channel coding (e.g. for DSL lines). The field is at the crossroads of mathematics, statistics, computer science, physics, neurobiology, and electrical engineering. Its impact has been crucial to success of the Voyager missions to deep space, the invention of the CD, the feasibility of mobile phones, the development of the Internet, the study of linguistics and of human perception, the understanding of black holes, and numerous other fields (Wikipedia, read more about this here).
Suppose you were to ask someone you just met for them to tell you their address. From a mathematical point of view, this experience is similar to randomly picking an address out of the yellow pages, say. Indeed, while you can't tell a priori what the answer is going to be, different locations are more or less likely depending on the context - the languages they speak, for example, or whether they go to your college.

Another important concept in information theory is that of *noise*, which generalizes the intuitive definition of that word. For example, a very loud car passing by could make you misshear the person you're talking to and alter the message that you get, as electromagnetic interferences could modify the signal your cell phone is picking up.

We are thus led to model the process of ``getting a message from A to B'' as a random process. In other words, receiving a message becomes *sampling a random variable*.

Suppose now given a question X whose possible answers are x_{1},...,x_{n}, and assume that the probabilty that the answer that you actually get be x_{i} is p_{i}. Thus by definition P(X = x_{i}) = p_{i}.

We would like to assign to this question X a numerial value - i.e. a number - H(X) called its *entropy* and measuring ``how much information you would get if you knew the answer to X''. How to define such a function is not obvious, at first sight. The following activity explores the first properties one would expect H(X) to satisfy if it indeed existed.

- The result of fair tossing a coin
- The result of tossing a coin that falls on heads 99% of the time

Now, suppose you have two questions X and Y, how much information is contained in knowing the answer to both X and Y? We know the information given by X is H(X) and that given by Y is H(Y), so you might be tempted to think the answer is H(X) + H(Y). Indeed, if you're given one pound of apples and one pound of oranges, you do have two pounds of fruits. One has to be more careful here, though. If X and Y are not inpendant (if X = Y for example) then the above additive formula can't hold. In other words, knowing one person's name and their name doesn't count as knowing two things about that person, obviously!!

The concept of entropy was originally a thermodynamic one which has then been adapted to other fields of study, including information theory, psychodynamics, thermoeconomics, and evolution.
The statistical definition of entropy, which describes it as the number of possible microscopic configurations of a macroscopic system, is the more fundamental definition and all other definitions and properties follow from it, including the definition used here.

Other mathematical considerations show that there *is* a function H that satisfies all these properties and whose definition is the following:

The following activity should help you convince yourself that this is a reasonable definition

- What is H(1)? H(0)? What about H(0.99)? Is this compatible with what we were expecting in activity 1?
- When is H(p) maximal? Try to solve for p in H'(p) = 0

The following activity is more challenging and involves calculating the entropy of a non-trivial random process:

- What is the entropy H (X) of the series of set winners?
- What is the entropy H (Y) of the winner variable?
- The probabilities of winning a set are now different. Let p denote the probability that player A wins a set. What is the probability p that maximizes the entropy H (X) and H (Y)?
- What is the probability p that minimizes H (X) and H (Y)?
- What are the entropies H (X) and H (Y), if player A wins every set?

Now that we have defined the entropy of a random variable and said that it measures the "information content" of a message, the only thing missing is a *unit*.

So what unit do we measure entropy in? From now on, we'll assume the logarithm in the definition of entropy is taken to be the base-2 logarithm log_{2}, defined by log_{2}(x) = log(x) / log(2).

- As a warm-up, what is log
_{2}(2^{n}) ? - Suppose we're interested in sending one character among 2
^{n}possible ones, all equally likely. What is the entropy of such a message? - Suppose now that you are given a panel of toggle switches lined up in a row. Suppose there are m switches. We would like to assign to each configuration of the switches one of the values the message can take. How many messages can you encode with m switches? How many switches does it take to encode 3 possibles messages? 54 messages?
- Suppose now we are sending a string of 0's, each 0 being one message. What is the entropy of such a message?

Knowing whether a switch is on the on or off position is the most elementary piece of knowledge -- as is any answer to a yes or no question. It can be encoded with an alphabet of two caracters, 0 (for off) and 1 (for on) for example. We call such an information a *bit*.

We have seen that it takes n bits, that is to say n switches if you will, to encode the value of a message that can take 2^{n} equally likely values. We have also seen that the entropy of such a message is exactly n.

On the other hand, if we already know the message, there is no need for any bits to encode the information, and the entropy is also 0.

This suggests that entropy might be a measure of how many bits are needed to encode the value of a message. Further investigations show that this is indeed true, as long as one considers this number of bits as a theoretical lower bound -- meaning you would need at least that many bits.

In other words, entropy measures the content of a message in *bits*

The following activity introduces the notion of *data compression* and shows how one can get closer to the theoretical bound provided by the entropy.

Suppose we are investigating sending a one character message that can take values amongst a, b, c or d with probabilities given in the following table:

x_{i} | a | b | c | d |

p_{i} | 2/3 | 4/27 | 4/27 | 1/27 |

- What is the entropy of such a message?
- How many bits are actually needed to assign to each combination one of the values a, b, c or d?
- Suppose we use the following association:
x _{i}a b c d Bits 00 01 10 11 We want to encode the following sequences of messages: aaabaccbaa and abacaaaaaa. How many bits are needed to encode those sequences? What are the corresponding bit sequences?

- Suppose now we use a different association. We want the most frequent messages to be encoded with the least number of bits, while the less frequent ones can take more. Consider the following association:

x _{i}a b c d Bits 0 10 110 111 What are now the bit sequences corresponding to our previous two sequences (aaabaccbaa and abacaaaa)? How many bits are actually used needed now? What is the average number of bits per message, that is to say per character? How does this relate to the entropy you calculated in question 1?

the following decision tree might help you amswer those questions.

So how does one go about constructing such trees? The example above is in fact a very example of what is called *data compression* and the idea used to construct the above table is based on the same kind of ideas the ZIP compression technique is. To learn more about this, you can read this.