Containers and Namespaces

Containers are Linux processes with isolated views of the system, achieved through two kernel features: namespaces (what a process can see) and cgroups (what a process can use). No hypervisor, no guest kernel — just restricted processes.

Why It Matters

Docker, Kubernetes, and every modern deployment platform is built on namespaces and cgroups. Understanding the underlying mechanism demystifies containers: they’re not VMs, they share the host kernel, and their isolation is only as strong as the kernel enforces.

Namespaces (Isolation)

Each namespace gives a process its own isolated view of a system resource:

NamespaceFlagWhat It Isolates
PIDCLONE_NEWPIDProcess IDs — container sees its own PID 1
NETCLONE_NEWNETNetwork interfaces, routing, iptables
MNTCLONE_NEWMNTMount points — own filesystem view
UTSCLONE_NEWUTSHostname and domain name
IPCCLONE_NEWIPCPOSIX message queues, shared memory
USERCLONE_NEWUSERUID/GID mapping (root inside, unprivileged outside)
CGROUPCLONE_NEWCGROUPCgroup root view

Creating a Namespace

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdio.h>
 
int child_fn(void *arg) {
    // Inside new PID + UTS namespace
    sethostname("container", 9);
    printf("PID inside: %d\n", getpid());  // prints 1
    execl("/bin/sh", "sh", NULL);
    return 0;
}
 
int main(void) {
    char stack[65536];
    int flags = CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS | SIGCHLD;
    pid_t pid = clone(child_fn, stack + sizeof(stack), flags, NULL);
    waitpid(pid, NULL, 0);
    return 0;
}

Inspecting Namespaces

ls -la /proc/self/ns/             # list your namespaces
lsns                              # list all namespaces on the system
nsenter -t PID --pid --net --mnt  # enter another process's namespaces
unshare --pid --fork --mount-proc /bin/bash  # create new namespace from shell

cgroups (Resource Limits)

Control groups limit and account for CPU, memory, IO, and PIDs per process group.

cgroups v2

# Create a cgroup
mkdir /sys/fs/cgroup/mycontainer
 
# Limit to 512MB RAM
echo 536870912 > /sys/fs/cgroup/mycontainer/memory.max
 
# Limit to 50% of one CPU
echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max
 
# Limit to 100 processes
echo 100 > /sys/fs/cgroup/mycontainer/pids.max
 
# Add a process
echo $PID > /sys/fs/cgroup/mycontainer/cgroup.procs
 
# Monitor usage
cat /sys/fs/cgroup/mycontainer/memory.current
cat /sys/fs/cgroup/mycontainer/cpu.stat
ControllerFilePurpose
memory.maxMemory limit (OOM kill if exceeded)Prevent runaway memory
cpu.maxCPU bandwidth (quota/period µs)Throttle CPU usage
pids.maxMax process countPrevent fork bombs
io.maxDisk IO bandwidthPrevent IO starvation

How Docker Uses These

docker run nginx =

1. Pull image → extract rootfs (layered filesystem: overlayfs)
2. clone() with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWMNT | CLONE_NEWUTS | CLONE_NEWIPC
3. pivot_root → switch to container's rootfs
4. Create veth pair → bridge network
5. Set cgroup limits (--memory, --cpus)
6. Drop capabilities (no CAP_SYS_ADMIN etc.)
7. exec() the entrypoint (nginx)

Container vs VM

AspectContainerVM
IsolationNamespace (process-level)Hardware (hypervisor)
KernelShared with hostOwn guest kernel
StartupMillisecondsSeconds-minutes
OverheadNearly zeroRAM for guest OS
SecurityWeaker (shared kernel attack surface)Stronger (hardware boundary)
Density100s per host10s per host

For stronger isolation with container speed: gVisor (user-space kernel) or Kata Containers (lightweight VMs with container UX).