Information Theory
Shannon’s mathematical framework for quantifying information, the limits of data compression, and the maximum rate of reliable communication over noisy channels. Published in 1948, it underpins compression (ZIP, MP3, H.264), error correction (WiFi, 5G, QR codes), cryptography, and even machine learning (cross-entropy loss).
Why It Matters
Information theory answers fundamental questions: How much can you compress a file? How fast can you send data over a noisy channel? When does a code provide enough error correction? These limits are absolute — no algorithm can beat them. Understanding entropy and channel capacity gives you the theoretical floor and ceiling for any data system.
Information and Surprise
The information content of an event is inversely related to its probability. Rare events carry more information:
I(x) = -log₂(P(x)) bits
P(x) = 1/2 → I(x) = 1 bit (coin flip)
P(x) = 1/8 → I(x) = 3 bits (picking one of 8 equally likely items)
P(x) = 1 → I(x) = 0 bits (certain event — no surprise)
Entropy
Entropy H(X) is the average information per symbol in a source — the expected surprise:
H(X) = -Σ P(x) · log₂(P(x)) bits per symbol
import numpy as np
def entropy(probs):
"""Shannon entropy in bits."""
probs = np.array([p for p in probs if p > 0])
return -np.sum(probs * np.log2(probs))
# Fair coin: H = 1 bit (maximum uncertainty for 2 outcomes)
print(entropy([0.5, 0.5])) # 1.0
# Biased coin (90/10): H = 0.47 bits (more predictable)
print(entropy([0.9, 0.1])) # 0.469
# Uniform over 256 symbols: H = 8 bits (like a random byte)
print(entropy([1/256] * 256)) # 8.0Maximum entropy: uniform distribution (all outcomes equally likely). Minimum entropy: one certain outcome (zero bits). This is why random data is incompressible — it’s already at maximum entropy.
Encoding and Compression
Shannon’s source coding theorem: a source with entropy H bits/symbol cannot be compressed below H bits/symbol on average. Entropy is the compression limit.
Huffman Coding Example
Assign shorter codes to more probable symbols:
from collections import Counter
import heapq
def huffman(text):
"""Build Huffman codes for a string."""
freq = Counter(text)
heap = [[count, [char, ""]] for char, count in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
lo = heapq.heappop(heap)
hi = heapq.heappop(heap)
for pair in lo[1:]: pair[1] = '0' + pair[1]
for pair in hi[1:]: pair[1] = '1' + pair[1]
heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return {char: code for char, code in heap[0][1:]}
text = "aaabbc"
codes = huffman(text)
print(codes) # e.g., {'a': '0', 'b': '10', 'c': '11'}
# Compression ratio
original_bits = len(text) * 8 # ASCII
compressed_bits = sum(len(codes[c]) for c in text)
print(f"Original: {original_bits} bits, Compressed: {compressed_bits} bits")
print(f"Entropy: {entropy([3/6, 2/6, 1/6]):.2f} bits/symbol")English text has ~1.0-1.5 bits/symbol entropy (vs 8 bits for ASCII). That’s why text compresses so well — it’s highly redundant.
Channel Capacity
Shannon-Hartley theorem: the maximum rate of error-free communication over a noisy channel:
C = B · log₂(1 + S/N) bits per second
B = channel bandwidth (Hz)
S/N = signal-to-noise ratio (linear, not dB)
Example: WiFi channel with 20 MHz bandwidth, SNR = 30 dB (1000 linear):
B = 20e6 # 20 MHz
SNR_dB = 30
SNR = 10**(SNR_dB / 10) # 1000
C = B * np.log2(1 + SNR)
print(f"Channel capacity: {C/1e6:.0f} Mbps") # ~200 MbpsThis is the absolute maximum — no coding scheme can exceed it. Modern codes (LDPC, Turbo, Polar) approach within 0.1 dB of the Shannon limit.
Error Correction
Redundancy enables reliable communication over noisy channels:
| Concept | Meaning |
|---|---|
| Hamming distance | Number of bit positions where two codewords differ |
| Minimum distance d | Can detect d-1 errors, correct ⌊(d-1)/2⌋ errors |
| Rate R = k/n | k information bits encoded into n bits (overhead = n-k) |
| Code | Rate | Error Correction | Use |
|---|---|---|---|
| Parity bit | (n-1)/n | Detect 1 error | Simple checks |
| Hamming(7,4) | 4/7 | Correct 1 error | Classic, educational |
| Reed-Solomon | Variable | Burst errors | QR codes, CDs, space |
| LDPC | Near capacity | High performance | WiFi 6, 5G, DVB |
| Turbo codes | Near capacity | High performance | 3G/4G, deep space |
| Polar codes | Capacity-achieving | Theoretical optimality | 5G control channel |
Mutual Information and KL Divergence
Mutual information I(X;Y): how much knowing Y tells you about X:
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
If I(X;Y) = 0, X and Y are independent. If I(X;Y) = H(X), Y completely determines X.
KL divergence D_KL(P||Q): measures how different distribution Q is from P:
D_KL(P||Q) = Σ P(x) · log(P(x)/Q(x))
KL divergence is not symmetric (D_KL(P||Q) ≠ D_KL(Q||P)). It appears in machine learning as the connection between cross-entropy loss and maximum likelihood estimation.
Connection to ML
Cross-entropy loss in classification = H(P_true) + D_KL(P_true || P_model). Minimizing cross-entropy = minimizing KL divergence from the true distribution. This is why information theory is foundational to modern machine learning.
Related
- Big O Notation — complexity of encoding/decoding algorithms
- Binary and Hexadecimal — bit-level representation
- TLS and Encryption — information-theoretic security (one-time pad)
- Signal Processing — bandwidth and SNR determine channel capacity
- Hash Tables — hash functions distribute information uniformly
- DNS Protocol — DNS compression uses label pointers (simple coding)