How Docker Works Internally?

How Docker Works Internally?
Published on 20.07.2025
Docker is such a cool project, mainly because it used existing fragmented features of linux and made it work together.

Docker's Isolation

Let's start with the core part, isolation. Linux has something called namespaces, which is a way to isolate a process. It's a way to create a new environment for a process to run in. The way you do it is by using the unshare syscall.

#include <sched.h>

int unshare(int flags);

There are a few flags you can use to create a new namespace:

CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET, CLONE_NEWCGROUP

And here's a simple example of how to create a new namespace:

#include <sched.h>
#include <unistd.h>

// Create new PID + mount namespaces
int unshare(CLONE_NEWPID | CLONE_NEWNS);

With this one can have a process isolated in different layers such as PID, mount, IPC, network, etc. 

There are some security contexts and capabilities that also help with isolation. For example one can drop capabilities to limit the privileges of a process.

#include <sys/capability.h>

// Drop all capabilities except CAP_NET_BIND_SERVICE
cap_t caps = cap_get_proc();
cap_clear(caps);
cap_value_t cap_list[1] = {CAP_NET_BIND_SERVICE};
cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_list, CAP_SET);
cap_set_proc(caps);

And some common capabilities that docker drops are CAP_SYS_ADMIN, CAP_NET_ADMIN and CAP_SYS_PTRACE. And there are also some seccomp profiles that can be used to limit the system calls that a process can make.

#include <seccomp.h>

// Create filter context
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

// Allow specific syscalls
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);

// Load filter
seccomp_load(ctx);
Docker blocks a lot of syscalls by default, like reboot and ptrace. Additionally, there are some AppArmor/SELinux integration that can be used to limit the access to the filesystem.

Resource Management

Docker uses cgroups to manage resources. Cgroups are a way to limit the resources that a process can use. Cgroups are mounted in the root of the filesystem and one can enforce limits by writing directly to the cgroup file.

// For example, to set a memory limit of 512MB:
int fd = open("/sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes", O_WRONLY);
write(fd, "536870912", 9);  // 512MB
close(fd);

/* 
* /sys/fs/cgroup/memory/memory.limit_in_bytes
* /sys/fs/cgroup/cpu/cpu.cfs_quota_us
* /sys/fs/cgroup/cpu/cpu.cfs_period_us
* /sys/fs/cgroup/cpuacct/cpuacct.usage_in_bytes
* /sys/fs/cgroup/cpuacct/cpuacct.usage_in_bytes
* many more... */

Container Image and Storage

Docker uses a unionfs to mount the image layers. Unionfs is a filesystem that allows you to overlay files and directories. And OverlayFS is a union filesystem based on the copy-on-write(COW) filesystem. 

Essentially, the image layers are mounted on top of each other and the changes are written to the top most layer. This is a very efficient way to store the image layers and it's also very fast to read the image layers. Every layer has a hash of the content and layers form a chain/hierarchy.

neo@matrix:~$ docker image inspect postgres:15 | jq '.[].RootFS'
{
    "Type": "layers",
    "Layers": [
    "sha256:58d7b77...",
    "sha256:4bf98da...",
    "sha256:f669ec5...",
    "sha256:d48d2d9...",
    "sha256:6f86137...",
    "sha256:4e633c9...",
    "sha256:c00d96a...",
    "..."
    ]
}
Networking

Docker uses a virtual network interface to connect the containers to the host network. Docker creates `docker0` bridge interface on host:

neo@matrix:~$ ip link show docker0
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 12:cc:c4:3e:5e:15 brd ff:ff:ff:ff:ff:ff
And it also creates a veth pair for each container. The veth pair is usually connected to `eth0` interface on the container through the `docker0` bridge.


For port mapping, docker uses iptables to forward the traffic to the container. So for exmaple the container gets internal IP, then docker creates iptables DNAT rule to translate host port to container port and return traffic gets automatically translated back. 

# Something like this
# Forward host:8080 to container:80
iptables -t nat -A DOCKER -p tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:80
# Allow forwarding
iptables -A DOCKER -d 172.17.0.2 -p tcp --dport 80 -j ACCEPT
Internally inside docker, the runc runtime is responsible for creating all these namespaces, setting up cgroups, mounts and networking.

Container Runtime Architecture

Docker uses a client-server architecture. The client is the docker CLI and the server is `dockerd`, a daemon. The daemon is responsible for creating and managing the containers.


dockerd: Image management, network setup, volume management, API server.

containerd: Container lifecycle management, image storage, runtime orchestration.

containerd-shim: Process supervision, stdio handling, signal forwarding, exit status collection.

runc: Low-level container creation, namespaces/cgroups setup, process execution.

That's a quick rundown of how docker works internally, but there's so much more to it, everything from OCI specifications to docker registry. Hope this was useful, if you have any questions or feedback, feel free to reach out - @pwnfunction.