Information Theory

Shannon’s mathematical framework for quantifying information, the limits of data compression, and the maximum rate of reliable communication over noisy channels. Published in 1948, it underpins compression (ZIP, MP3, H.264), error correction (WiFi, 5G, QR codes), cryptography, and even machine learning (cross-entropy loss).

Why It Matters

Information theory answers fundamental questions: How much can you compress a file? How fast can you send data over a noisy channel? When does a code provide enough error correction? These limits are absolute — no algorithm can beat them. Understanding entropy and channel capacity gives you the theoretical floor and ceiling for any data system.

Information and Surprise

The information content of an event is inversely related to its probability. Rare events carry more information:

I(x) = -log₂(P(x))   bits

P(x) = 1/2  →  I(x) = 1 bit    (coin flip)
P(x) = 1/8  →  I(x) = 3 bits   (picking one of 8 equally likely items)
P(x) = 1    →  I(x) = 0 bits   (certain event — no surprise)

Entropy

Entropy H(X) is the average information per symbol in a source — the expected surprise:

H(X) = -Σ P(x) · log₂(P(x))    bits per symbol
import numpy as np
 
def entropy(probs):
    """Shannon entropy in bits."""
    probs = np.array([p for p in probs if p > 0])
    return -np.sum(probs * np.log2(probs))
 
# Fair coin: H = 1 bit (maximum uncertainty for 2 outcomes)
print(entropy([0.5, 0.5]))      # 1.0
 
# Biased coin (90/10): H = 0.47 bits (more predictable)
print(entropy([0.9, 0.1]))      # 0.469
 
# Uniform over 256 symbols: H = 8 bits (like a random byte)
print(entropy([1/256] * 256))   # 8.0

Maximum entropy: uniform distribution (all outcomes equally likely). Minimum entropy: one certain outcome (zero bits). This is why random data is incompressible — it’s already at maximum entropy.

Encoding and Compression

Shannon’s source coding theorem: a source with entropy H bits/symbol cannot be compressed below H bits/symbol on average. Entropy is the compression limit.

Huffman Coding Example

Assign shorter codes to more probable symbols:

from collections import Counter
import heapq
 
def huffman(text):
    """Build Huffman codes for a string."""
    freq = Counter(text)
    heap = [[count, [char, ""]] for char, count in freq.items()]
    heapq.heapify(heap)
 
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in lo[1:]: pair[1] = '0' + pair[1]
        for pair in hi[1:]: pair[1] = '1' + pair[1]
        heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
 
    return {char: code for char, code in heap[0][1:]}
 
text = "aaabbc"
codes = huffman(text)
print(codes)  # e.g., {'a': '0', 'b': '10', 'c': '11'}
 
# Compression ratio
original_bits = len(text) * 8  # ASCII
compressed_bits = sum(len(codes[c]) for c in text)
print(f"Original: {original_bits} bits, Compressed: {compressed_bits} bits")
print(f"Entropy: {entropy([3/6, 2/6, 1/6]):.2f} bits/symbol")

English text has ~1.0-1.5 bits/symbol entropy (vs 8 bits for ASCII). That’s why text compresses so well — it’s highly redundant.

Channel Capacity

Shannon-Hartley theorem: the maximum rate of error-free communication over a noisy channel:

C = B · log₂(1 + S/N)    bits per second

B = channel bandwidth (Hz)
S/N = signal-to-noise ratio (linear, not dB)

Example: WiFi channel with 20 MHz bandwidth, SNR = 30 dB (1000 linear):

B = 20e6       # 20 MHz
SNR_dB = 30
SNR = 10**(SNR_dB / 10)   # 1000
 
C = B * np.log2(1 + SNR)
print(f"Channel capacity: {C/1e6:.0f} Mbps")  # ~200 Mbps

This is the absolute maximum — no coding scheme can exceed it. Modern codes (LDPC, Turbo, Polar) approach within 0.1 dB of the Shannon limit.

Error Correction

Redundancy enables reliable communication over noisy channels:

ConceptMeaning
Hamming distanceNumber of bit positions where two codewords differ
Minimum distance dCan detect d-1 errors, correct ⌊(d-1)/2⌋ errors
Rate R = k/nk information bits encoded into n bits (overhead = n-k)
CodeRateError CorrectionUse
Parity bit(n-1)/nDetect 1 errorSimple checks
Hamming(7,4)4/7Correct 1 errorClassic, educational
Reed-SolomonVariableBurst errorsQR codes, CDs, space
LDPCNear capacityHigh performanceWiFi 6, 5G, DVB
Turbo codesNear capacityHigh performance3G/4G, deep space
Polar codesCapacity-achievingTheoretical optimality5G control channel

Mutual Information and KL Divergence

Mutual information I(X;Y): how much knowing Y tells you about X:

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

If I(X;Y) = 0, X and Y are independent. If I(X;Y) = H(X), Y completely determines X.

KL divergence D_KL(P||Q): measures how different distribution Q is from P:

D_KL(P||Q) = Σ P(x) · log(P(x)/Q(x))

KL divergence is not symmetric (D_KL(P||Q) ≠ D_KL(Q||P)). It appears in machine learning as the connection between cross-entropy loss and maximum likelihood estimation.

Connection to ML

Cross-entropy loss in classification = H(P_true) + D_KL(P_true || P_model). Minimizing cross-entropy = minimizing KL divergence from the true distribution. This is why information theory is foundational to modern machine learning.