File Systems
A file system organizes data on a storage device into named files and directories. It maps human-readable paths to blocks of data on disk, tracking metadata like permissions, timestamps, and ownership.
Why It Matters
Every program reads and writes files. Understanding how file systems work explains why fsync matters for databases, why deleting a 1GB file is instant, why hard links exist, and how journaling prevents data corruption after a power failure.
Inodes and Directory Entries
An inode stores all metadata about a file except its name:
Inode 42:
type: regular file
size: 8192 bytes
mode: 0644 (rw-r--r--)
uid/gid: 1000/1000
timestamps: atime, mtime, ctime
link count: 1
data blocks: [100, 101] (or extents)
A directory is a file mapping names → inode numbers:
Directory inode 10:
"hello.txt" → inode 42
"notes.md" → inode 57
".." → inode 2
This is why renaming a file in the same directory is instant — it only updates the directory entry, not the data. Hard links create a second name pointing to the same inode.
ext4 Disk Layout
┌────────────┬──────────────┬──────────────┬─────────────┐
│ Superblock │ Block Group 0│ Block Group 1│ ... │
│ (metadata) │ │ │ │
└────────────┴──────────────┴──────────────┴─────────────┘
Block Group:
┌──────┬──────┬───────────┬─────────────┐
│Bitmap│Bitmap│ Inode │ Data Blocks │
│(block│(inode│ Table │ │
│ alloc│alloc)│ │ │
└──────┴──────┴───────────┴─────────────┘
- Superblock: filesystem-wide metadata (block size, total blocks, free count)
- Block groups: divide disk into manageable sections with local allocation
- Extents: contiguous block ranges — ext4 stores “blocks 100-199” instead of 100 individual pointers
- Block size: typically 4KB (matches page size)
Journaling
Problem: a crash mid-write can leave the filesystem inconsistent (allocated block but no inode pointing to it, or vice versa).
Solution: write-ahead logging. Changes go to a journal first, then to their actual locations.
1. Write metadata changes to journal
2. Write data to actual location
3. Mark journal entry as committed
4. If crash before step 3 → replay journal on mount
ext4 journals metadata by default. Full data journaling (data=journal) is safer but slower. Most databases (PostgreSQL, SQLite) implement their own journaling/WAL on top.
VFS (Virtual Filesystem)
Linux supports many filesystems through a common abstraction layer:
User: open("file.txt")
↓
Kernel: VFS (virtual filesystem switch)
↓ dispatch based on mount point
ext4 / xfs / btrfs / nfs / tmpfs / procfs
↓
Block device / network / memory
Everything uses the same syscall interface. open/read/write/close work identically whether the file is on ext4, NFS, or /proc.
Key Operations and Their Cost
| Operation | What Happens | Notes |
|---|---|---|
open() | Walk path, load inode into inode cache | Cached after first access |
read() | Map offset → blocks via inode, read from page cache | Most reads hit cache |
write() | Allocate blocks if needed, write to page cache, journal | Actual disk write is async |
fsync() | Flush page cache + journal to disk | Guarantees durability |
unlink() | Remove dir entry, decrement link count | File deleted when link_count=0 AND no open fds |
rename() | Update directory entries (atomic on same fs) | Used by databases for atomic file replacement |
Practical Commands
stat file.txt # show inode details
ls -i # show inode numbers
df -h # disk usage per filesystem
du -sh dir/ # directory size
mount # list mounted filesystems
debugfs /dev/sda1 # inspect ext4 internals (read-only safe)Related
- File IO in C — the syscall interface to files
- Signals and IPC — named pipes (FIFOs) are filesystem objects
- Memory Management — page cache sits between VFS and disk