Containers and Namespaces
Containers are Linux processes with isolated views of the system, achieved through two kernel features: namespaces (what a process can see) and cgroups (what a process can use). No hypervisor, no guest kernel — just restricted processes.
Why It Matters
Docker, Kubernetes, and every modern deployment platform is built on namespaces and cgroups. Understanding the underlying mechanism demystifies containers: they’re not VMs, they share the host kernel, and their isolation is only as strong as the kernel enforces.
Namespaces (Isolation)
Each namespace gives a process its own isolated view of a system resource:
| Namespace | Flag | What It Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs — container sees its own PID 1 |
| NET | CLONE_NEWNET | Network interfaces, routing, iptables |
| MNT | CLONE_NEWMNT | Mount points — own filesystem view |
| UTS | CLONE_NEWUTS | Hostname and domain name |
| IPC | CLONE_NEWIPC | POSIX message queues, shared memory |
| USER | CLONE_NEWUSER | UID/GID mapping (root inside, unprivileged outside) |
| CGROUP | CLONE_NEWCGROUP | Cgroup root view |
Creating a Namespace
#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdio.h>
int child_fn(void *arg) {
// Inside new PID + UTS namespace
sethostname("container", 9);
printf("PID inside: %d\n", getpid()); // prints 1
execl("/bin/sh", "sh", NULL);
return 0;
}
int main(void) {
char stack[65536];
int flags = CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(child_fn, stack + sizeof(stack), flags, NULL);
waitpid(pid, NULL, 0);
return 0;
}Inspecting Namespaces
ls -la /proc/self/ns/ # list your namespaces
lsns # list all namespaces on the system
nsenter -t PID --pid --net --mnt # enter another process's namespaces
unshare --pid --fork --mount-proc /bin/bash # create new namespace from shellcgroups (Resource Limits)
Control groups limit and account for CPU, memory, IO, and PIDs per process group.
cgroups v2
# Create a cgroup
mkdir /sys/fs/cgroup/mycontainer
# Limit to 512MB RAM
echo 536870912 > /sys/fs/cgroup/mycontainer/memory.max
# Limit to 50% of one CPU
echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max
# Limit to 100 processes
echo 100 > /sys/fs/cgroup/mycontainer/pids.max
# Add a process
echo $PID > /sys/fs/cgroup/mycontainer/cgroup.procs
# Monitor usage
cat /sys/fs/cgroup/mycontainer/memory.current
cat /sys/fs/cgroup/mycontainer/cpu.stat| Controller | File | Purpose |
|---|---|---|
memory.max | Memory limit (OOM kill if exceeded) | Prevent runaway memory |
cpu.max | CPU bandwidth (quota/period µs) | Throttle CPU usage |
pids.max | Max process count | Prevent fork bombs |
io.max | Disk IO bandwidth | Prevent IO starvation |
How Docker Uses These
docker run nginx =
1. Pull image → extract rootfs (layered filesystem: overlayfs)
2. clone() with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWMNT | CLONE_NEWUTS | CLONE_NEWIPC
3. pivot_root → switch to container's rootfs
4. Create veth pair → bridge network
5. Set cgroup limits (--memory, --cpus)
6. Drop capabilities (no CAP_SYS_ADMIN etc.)
7. exec() the entrypoint (nginx)
Container vs VM
| Aspect | Container | VM |
|---|---|---|
| Isolation | Namespace (process-level) | Hardware (hypervisor) |
| Kernel | Shared with host | Own guest kernel |
| Startup | Milliseconds | Seconds-minutes |
| Overhead | Nearly zero | RAM for guest OS |
| Security | Weaker (shared kernel attack surface) | Stronger (hardware boundary) |
| Density | 100s per host | 10s per host |
For stronger isolation with container speed: gVisor (user-space kernel) or Kata Containers (lightweight VMs with container UX).
Related
- Processes and Threads — containers are processes with restricted views
- Memory Management — cgroups limit memory, OOM killer enforces
- File Systems — overlayfs layers for container images
- Signals and IPC — IPC namespace isolates shared memory/semaphores