File Systems

A file system organizes data on a storage device into named files and directories. It maps human-readable paths to blocks of data on disk, tracking metadata like permissions, timestamps, and ownership.

Why It Matters

Every program reads and writes files. Understanding how file systems work explains why fsync matters for databases, why deleting a 1GB file is instant, why hard links exist, and how journaling prevents data corruption after a power failure.

Inodes and Directory Entries

An inode stores all metadata about a file except its name:

Inode 42:
  type: regular file
  size: 8192 bytes
  mode: 0644 (rw-r--r--)
  uid/gid: 1000/1000
  timestamps: atime, mtime, ctime
  link count: 1
  data blocks: [100, 101] (or extents)

A directory is a file mapping names → inode numbers:

Directory inode 10:
  "hello.txt" → inode 42
  "notes.md"  → inode 57
  ".."        → inode 2

This is why renaming a file in the same directory is instant — it only updates the directory entry, not the data. Hard links create a second name pointing to the same inode.

ext4 Disk Layout

┌────────────┬──────────────┬──────────────┬─────────────┐
│ Superblock │ Block Group 0│ Block Group 1│    ...      │
│ (metadata) │              │              │             │
└────────────┴──────────────┴──────────────┴─────────────┘

Block Group:
┌──────┬──────┬───────────┬─────────────┐
│Bitmap│Bitmap│ Inode     │ Data Blocks │
│(block│(inode│ Table     │             │
│ alloc│alloc)│           │             │
└──────┴──────┴───────────┴─────────────┘
  • Superblock: filesystem-wide metadata (block size, total blocks, free count)
  • Block groups: divide disk into manageable sections with local allocation
  • Extents: contiguous block ranges — ext4 stores “blocks 100-199” instead of 100 individual pointers
  • Block size: typically 4KB (matches page size)

Journaling

Problem: a crash mid-write can leave the filesystem inconsistent (allocated block but no inode pointing to it, or vice versa).

Solution: write-ahead logging. Changes go to a journal first, then to their actual locations.

1. Write metadata changes to journal
2. Write data to actual location
3. Mark journal entry as committed
4. If crash before step 3 → replay journal on mount

ext4 journals metadata by default. Full data journaling (data=journal) is safer but slower. Most databases (PostgreSQL, SQLite) implement their own journaling/WAL on top.

VFS (Virtual Filesystem)

Linux supports many filesystems through a common abstraction layer:

User:      open("file.txt")
             ↓
Kernel:    VFS (virtual filesystem switch)
             ↓ dispatch based on mount point
           ext4 / xfs / btrfs / nfs / tmpfs / procfs
             ↓
           Block device / network / memory

Everything uses the same syscall interface. open/read/write/close work identically whether the file is on ext4, NFS, or /proc.

Key Operations and Their Cost

OperationWhat HappensNotes
open()Walk path, load inode into inode cacheCached after first access
read()Map offset → blocks via inode, read from page cacheMost reads hit cache
write()Allocate blocks if needed, write to page cache, journalActual disk write is async
fsync()Flush page cache + journal to diskGuarantees durability
unlink()Remove dir entry, decrement link countFile deleted when link_count=0 AND no open fds
rename()Update directory entries (atomic on same fs)Used by databases for atomic file replacement

Practical Commands

stat file.txt              # show inode details
ls -i                      # show inode numbers
df -h                      # disk usage per filesystem
du -sh dir/                # directory size
mount                      # list mounted filesystems
debugfs /dev/sda1          # inspect ext4 internals (read-only safe)