Information Theory

Shannon’s mathematical framework for quantifying information, the limits of data compression, and the maximum rate of reliable communication over noisy channels. Published in 1948, it underpins compression (ZIP, MP3, H.264), error correction (WiFi, 5G, QR codes), cryptography, and even machine learning (cross-entropy loss).

Why It Matters

Information theory answers fundamental questions: How much can you compress a file? How fast can you send data over a noisy channel? When does a code provide enough error correction? These limits are absolute — no algorithm can beat them. Understanding entropy and channel capacity gives you the theoretical floor and ceiling for any data system.

Information and Surprise

The information content of an event is inversely related to its probability. Rare events carry more information:

I(x) = -log₂(P(x))   bits

P(x) = 1/2  →  I(x) = 1 bit    (coin flip)
P(x) = 1/8  →  I(x) = 3 bits   (picking one of 8 equally likely items)
P(x) = 1    →  I(x) = 0 bits   (certain event — no surprise)

Entropy

Entropy H(X) is the average information per symbol in a source — the expected surprise:

H(X) = -Σ P(x) · log₂(P(x))    bits per symbol

import numpy as np
 
def entropy(probs):
    """Shannon entropy in bits."""
    probs = np.array([p for p in probs if p > 0])
    return -np.sum(probs * np.log2(probs))
 
# Fair coin: H = 1 bit (maximum uncertainty for 2 outcomes)
print(entropy([0.5, 0.5]))      # 1.0
 
# Biased coin (90/10): H = 0.47 bits (more predictable)
print(entropy([0.9, 0.1]))      # 0.469
 
# Uniform over 256 symbols: H = 8 bits (like a random byte)
print(entropy([1/256] * 256))   # 8.0

Maximum entropy: uniform distribution (all outcomes equally likely). Minimum entropy: one certain outcome (zero bits). This is why random data is incompressible — it’s already at maximum entropy.

Encoding and Compression

Shannon’s source coding theorem: a source with entropy H bits/symbol cannot be compressed below H bits/symbol on average. Entropy is the compression limit.

Huffman Coding Example

Assign shorter codes to more probable symbols:

from collections import Counter
import heapq
 
def huffman(text):
    """Build Huffman codes for a string."""
    freq = Counter(text)
    heap = [[count, [char, ""]] for char, count in freq.items()]
    heapq.heapify(heap)
 
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in lo[1:]: pair[1] = '0' + pair[1]
        for pair in hi[1:]: pair[1] = '1' + pair[1]
        heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
 
    return {char: code for char, code in heap[0][1:]}
 
text = "aaabbc"
codes = huffman(text)
print(codes)  # e.g., {'a': '0', 'b': '10', 'c': '11'}
 
# Compression ratio
original_bits = len(text) * 8  # ASCII
compressed_bits = sum(len(codes[c]) for c in text)
print(f"Original: {original_bits} bits, Compressed: {compressed_bits} bits")
print(f"Entropy: {entropy([3/6, 2/6, 1/6]):.2f} bits/symbol")

English text has ~1.0-1.5 bits/symbol entropy (vs 8 bits for ASCII). That’s why text compresses so well — it’s highly redundant.

Channel Capacity

Shannon-Hartley theorem: the maximum rate of error-free communication over a noisy channel:

C = B · log₂(1 + S/N)    bits per second

B = channel bandwidth (Hz)
S/N = signal-to-noise ratio (linear, not dB)

Example: WiFi channel with 20 MHz bandwidth, SNR = 30 dB (1000 linear):

B = 20e6       # 20 MHz
SNR_dB = 30
SNR = 10**(SNR_dB / 10)   # 1000
 
C = B * np.log2(1 + SNR)
print(f"Channel capacity: {C/1e6:.0f} Mbps")  # ~200 Mbps

This is the absolute maximum — no coding scheme can exceed it. Modern codes (LDPC, Turbo, Polar) approach within 0.1 dB of the Shannon limit.

Error Correction

Redundancy enables reliable communication over noisy channels:

Concept	Meaning
Hamming distance	Number of bit positions where two codewords differ
Minimum distance d	Can detect d-1 errors, correct ⌊(d-1)/2⌋ errors
Rate R = k/n	k information bits encoded into n bits (overhead = n-k)

Code	Rate	Error Correction	Use
Parity bit	(n-1)/n	Detect 1 error	Simple checks
Hamming(7,4)	4/7	Correct 1 error	Classic, educational
Reed-Solomon	Variable	Burst errors	QR codes, CDs, space
LDPC	Near capacity	High performance	WiFi 6, 5G, DVB
Turbo codes	Near capacity	High performance	3G/4G, deep space
Polar codes	Capacity-achieving	Theoretical optimality	5G control channel

Mutual Information and KL Divergence

Mutual information I(X;Y): how much knowing Y tells you about X:

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

If I(X;Y) = 0, X and Y are independent. If I(X;Y) = H(X), Y completely determines X.

KL divergence D_KL(P||Q): measures how different distribution Q is from P:

D_KL(P||Q) = Σ P(x) · log(P(x)/Q(x))

KL divergence is not symmetric (D_KL(P||Q) ≠ D_KL(Q||P)). It appears in machine learning as the connection between cross-entropy loss and maximum likelihood estimation.

Connection to ML

Cross-entropy loss in classification = H(P_true) + D_KL(P_true || P_model). Minimizing cross-entropy = minimizing KL divergence from the true distribution. This is why information theory is foundational to modern machine learning.

Big O Notation — complexity of encoding/decoding algorithms
Binary and Hexadecimal — bit-level representation
TLS and Encryption — information-theoretic security (one-time pad)
Signal Processing — bandwidth and SNR determine channel capacity
Hash Tables — hash functions distribute information uniformly
DNS Protocol — DNS compression uses label pointers (simple coding)

Engineering Notes

Explorer

Information Theory

Information Theory

Why It Matters

Information and Surprise

Entropy

Encoding and Compression

Huffman Coding Example

Channel Capacity

Error Correction

Mutual Information and KL Divergence

Connection to ML

Graph View

Table of Contents

Backlinks

Engineering Notes

Explorer

Information Theory

Information Theory

Why It Matters

Information and Surprise

Entropy

Encoding and Compression

Huffman Coding Example

Channel Capacity

Error Correction

Mutual Information and KL Divergence

Connection to ML

Related

Graph View

Table of Contents

Backlinks