Scheduling

The OS scheduler decides which process or thread runs on which CPU core, and for how long. It must balance throughput, latency, fairness, and responsiveness — often conflicting goals.

Why It Matters

Scheduling determines whether your interactive app feels snappy or laggy, whether a batch job finishes fast, and whether real-time deadlines are met. Understanding schedulers helps you reason about tail latency, thread pool sizing, and CPU affinity.

Scheduling Algorithms

AlgorithmHow It WorksPreemptive?Tradeoff
FIFO/FCFSRun to completion in arrival orderNoSimple, but one CPU-hog blocks everyone
SJFShortest job firstNoOptimal avg wait, but must predict job length
Round RobinFixed time slice, rotate queueYesFair, but context switch overhead
PriorityHighest priority runs firstYesResponsive, but can starve low priority
MLFQMultiple RR queues, demote CPU-hogsYesGood general purpose (BSD, Windows)
CFSVirtual runtime tree, pick smallestYesFair, proportional share (Linux default)

Round Robin Detail

Time slice = 10ms

Process A: ████──────   (runs 10ms, preempted)
Process B: ────████──   (runs 10ms, preempted)
Process C: ────────██   (runs 5ms, done)
Process A: ██████────   (continues)

Shorter time slices = better responsiveness, more context switch overhead. Typical values: 1-10ms.

MLFQ (Multi-Level Feedback Queue)

Queue 0 (highest priority, 1ms slice)  ← new tasks start here
Queue 1 (medium priority, 4ms slice)   ← demoted if uses full slice
Queue 2 (lowest priority, 16ms slice)  ← CPU-bound tasks land here

Rules:

  1. New tasks enter highest queue
  2. If a task uses its full time slice, demote it
  3. If a task blocks before its slice expires, keep it in the same queue
  4. Periodically boost all tasks to highest queue (prevent starvation)

Interactive tasks (short CPU bursts, lots of IO) stay in high queues. CPU-bound tasks sink to low queues.

Context Switch Cost

A context switch saves/restores: registers, program counter, stack pointer, and flushes TLB entries. Direct cost is ~1-10μs, but the indirect cost is much higher:

  • TLB flush: virtual→physical cache invalidated, causes TLB misses
  • Cache pollution: new process fills L1/L2 cache, evicting the old process’s data
  • Pipeline flush: CPU speculative execution state is lost

Rule of thumb: minimize context switches for throughput-sensitive workloads. Use thread pinning (taskset, pthread_setaffinity_np) for latency-sensitive code.

Linux CFS (Completely Fair Scheduler)

Linux’s default scheduler since 2.6.23. Each task tracks vruntime — how much CPU time it has received, weighted by priority.

vruntime += actual_runtime × (default_weight / task_weight)
  • Tasks are stored in a red-black tree sorted by vruntime
  • Scheduler always picks the leftmost node (smallest vruntime) → O(log n)
  • Higher nice value → higher weight divisor → vruntime grows faster → gets less CPU
nice -n 10 ./background_job      # lower priority (nice 10)
nice -n -5 ./latency_sensitive   # higher priority (needs root for negative nice)
renice -n 5 -p PID               # change running process priority

Priority Inversion

A high-priority task blocks on a lock held by a low-priority task, while a medium-priority task runs freely. The high-priority task is effectively running at low priority.

High (blocked on lock) ──────────────── BLOCKED
Medium (running)        ████████████████ RUNNING
Low (holding lock)      ──────────────── can't run to release lock!

Solutions: priority inheritance (temporarily boost lock-holder to waiter’s priority) or priority ceiling (lock has a fixed ceiling priority).

Real-Time Scheduling

Linux supports POSIX real-time classes above CFS:

ClassPolicyUse Case
SCHED_FIFORun until blocked or preempted by higher priorityHard real-time
SCHED_RRFIFO + time slice rotation within same prioritySoft real-time
SCHED_DEADLINEEarliest deadline firstPeriodic real-time tasks
SCHED_OTHERCFS (normal)Everything else
chrt -f 50 ./realtime_app       # SCHED_FIFO at priority 50
chrt -r 30 ./soft_realtime      # SCHED_RR at priority 30