—
layout: page
title: “Readme”
category: “docker”
—
Two core concepts:
- Namespaces: Keeps your processes separated in isolated groups
- Cgroups: Controls the resources allocated
Namespaces
docker run traefik
pstree -spa 66560
systemd,1 --system --deserialize 18
└─containerd-shim,66535 -namespace moby -id 0ac949292b659a21e0037c91c7149f6fea12235ae4c5840d8448714081973154 -address /run/containerd/containerd.sock
└─traefik,66560 traefik
nsenter - run program in different namespaces
Filesystem (Mount Namespace) comparision
Similary, we can check
sudo nsenter -t 66560 -u -- hostnameandhostnamefor Hostname (UTS Namespace)sudo nsenter -t 66560 -u -- ip addrandip addrfor Network (Net Namespace)
Implementation
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
/* Define a stack for clone, stack size 1M */
#define STACK_SIZE (1024 * 1024)
static char container_stack [ STACK_SIZE ] ;
char * const container_args [] = {
"/bin/bash" ,
NULL
} ;
int container_main(void* arg)
{
/* Looking at the PID of the child process,
we can see that the pid of the output child process is 1 */
printf("Container [%5d] - inside the container!\n", getpid());
sethostname("container",10);
execv(container_args[0], container_args);
printf("Something's wrong!\n");
return 1;
}
int main()
{
printf("Parent [%5d] - start a container!\n", getpid());
/* PID namespace - CLONE_NEWPID */
int container_pid = clone(container_main, container_stack+STACK_SIZE,
CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD, NULL);
waitpid(container_pid, NULL, 0);
printf("Parent - container stopped!\n");
return 0;
}
Output
hchen@ubuntu:~$ sudo ./pid
Parent [ 3474] - start a container!
Container [ 1] - inside the container!
root@container:~# echo $$
1
Ref: https://coolshell.cn/articles/17010.html
Cgroups
➜ ~ head -n 1 /proc/66560/cgroup
12:pids:/docker/0ac949292b659a21e0037c91c7149f6fea12235ae4c5840d8448714081973154
On most Linux systems, this very large number(2^63 - 1) is used to represent an “unlimited” or “no-limit” setting within cgroups.
➜ ~ cat /sys/fs/cgroup/memory/docker/0ac949292b659a21e0037c91c7149f6fea12235ae4c5840d8448714081973154/memory.limit_in_bytes
9223372036854771712
Deep Dive into Docker Internals - Union Filesystem
https://martinheinz.dev/blog/44
Overlay filesystem
What is overlay fs https://wiki.archlinux.org/title/Overlay_filesystem
Ubuntu example
How containers use this
If container writes any files, it doesn’t modify anything in lower layers
docker run -d traefik
c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6
Running mount -t overlay inside the docker
docker exec -it c702369a8429 sh
/ # mount -t overlay
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/CK3RK6RKLXTDLCVT7J6XUNJFYI:/var/lib/docker/overlay2/l/MJZW5RC5EQX5QV64ZQFI5YRA6V:/var/lib/docker/overlay2/l/XG3WJGGNM4CP67RWANTABIWBOL:/var/lib/docker/overlay2/l/X32XXQFB6ADFFO2FLDCVIV6J2K:/var/lib/docker/overlay2/l/T72XWGVHJ6FWJXBYGSBLRK6FPE,upperdir=/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/diff,workdir=/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/work)
docker inspect c702369a8429 | grep GraphDriver -A 8
"GraphDriver": {
"Data": {
"ID": "c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6",
"LowerDir": "/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413-init/diff:/var/lib/docker/overlay2/0107d134713b05fc02091a41f1da372a9c9a0b7442f0c6a9ec130ace13940fe8/diff:/var/lib/docker/overlay2/8e8803ebddca09cd58274141eed8e426ddb4d3b96273cdda29c61f17ca20513b/diff:/var/lib/docker/overlay2/6b075fb9786d41cae6451f6ccc4e7708133646b57f45460394508e63a0da822b/diff:/var/lib/docker/overlay2/8beff5c84e30b1915a9017f659232bacde302c7386b5a9b7e4196b3932492780/diff",
"MergedDir": "/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/merged",
"UpperDir": "/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/diff",
"WorkDir": "/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/work"
},
"Name": "overlay2"
Also on the host by searching merged dir /var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/merged
➜ ~ mount | grep 79ded441a3bd88ad3721bf119dc6266904
overlay on /var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/CK3RK6RKLXTDLCVT7J6XUNJFYI:/var/lib/docker/overlay2/l/MJZW5RC5EQX5QV64ZQFI5YRA6V:/var/lib/docker/overlay2/l/XG3WJGGNM4CP67RWANTABIWBOL:/var/lib/docker/overlay2/l/X32XXQFB6ADFFO2FLDCVIV6J2K:/var/lib/docker/overlay2/l/T72XWGVHJ6FWJXBYGSBLRK6FPE,upperdir=/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/diff,workdir=/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/work)
➜ ~ sudo findmnt --target /var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/merged
TARGET SOURCE FSTYPE OPTIONS
/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/merged overlay overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/CK3RK6RKLXTDLCVT7J6XUNJFYI:/var/lib/docker/overlay2/l
➜ ~
In the following, we can see detailed map of the filesystem environment for docker container process 14188:
- The Root Filesystem (OverlayFS)
1121 ... / ... overlay overlay rw,lowerdir=...,upperdir=...,workdir=... - Virtual Filesystems (/proc, /dev, /sys)
1122 proc 1123 tmpfs 1125 sysfs ... - Cgroups Mounts
/sys/fs/cgroup/*1127 ... - Container-Specific Configuration Files
/resolv.conf /hostname /hosts
➜ ~ sudo cat /proc/14188/mountinfo
1121 994 0:80 / / rw,relatime - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/CK3RK6RKLXTDLCVT7J6XUNJFYI:/var/lib/docker/overlay2/l/MJZW5RC5EQX5QV64ZQFI5YRA6V:/var/lib/docker/overlay2/l/XG3WJGGNM4CP67RWANTABIWBOL:/var/lib/docker/overlay2/l/X32XXQFB6ADFFO2FLDCVIV6J2K:/var/lib/docker/overlay2/l/T72XWGVHJ6FWJXBYGSBLRK6FPE,upperdir=/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/diff,workdir=/var/lib/docker/overlay2/79ded441a3bd88ad3721bf119dc626690444ce58c9ed378f5a1b923667abe413/work
1122 1121 0:87 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
1123 1121 0:88 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
1124 1123 0:89 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
1125 1121 0:90 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
1126 1125 0:91 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755
1127 1126 0:29 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/systemd ro,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd
1128 1126 0:31 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/perf_event ro,nosuid,nodev,noexec,relatime master:14 - cgroup cgroup rw,perf_event
1129 1126 0:32 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,cpu,cpuacct
1130 1126 0:33 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/cpuset ro,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,cpuset
1131 1126 0:34 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/blkio ro,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,blkio
1132 1126 0:35 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/rdma ro,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,rdma
1133 1126 0:36 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/net_cls,net_prio ro,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,net_cls,net_prio
1134 1126 0:37 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/devices ro,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,devices
1135 1126 0:38 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/freezer ro,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,freezer
1136 1126 0:39 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/hugetlb ro,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,hugetlb
1137 1126 0:40 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,memory
1138 1126 0:41 /docker/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6 /sys/fs/cgroup/pids ro,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,pids
1139 1123 0:86 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1140 1123 0:92 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k
1141 1121 8:1 /var/lib/docker/containers/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered
1142 1121 8:1 /var/lib/docker/containers/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6/hostname /etc/hostname rw,relatime - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered
1143 1121 8:1 /var/lib/docker/containers/c702369a8429445312f561631ef8871ed9b8c055551151e549190398fef936e6/hosts /etc/hosts rw,relatime - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered
995 1122 0:87 /bus /proc/bus ro,nosuid,nodev,noexec,relatime - proc proc rw
996 1122 0:87 /fs /proc/fs ro,nosuid,nodev,noexec,relatime - proc proc rw
997 1122 0:87 /irq /proc/irq ro,nosuid,nodev,noexec,relatime - proc proc rw
998 1122 0:87 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
999 1122 0:87 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
1012 1122 0:93 / /proc/acpi ro,relatime - tmpfs tmpfs ro
1013 1122 0:88 /null /proc/interrupts rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
1014 1122 0:88 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
1015 1122 0:88 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
1016 1122 0:88 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
1017 1122 0:88 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
1018 1122 0:94 / /proc/scsi ro,relatime - tmpfs tmpfs ro
1019 1125 0:95 / /sys/firmware ro,relatime - tmpfs tmpfs ro
Ignore typos my AI generated image
Changes to the image
➜ ~ docker diff c702369a8429
A /nishanth.txt
C /root
A /root/.ash_history
Astands for Add.- This indicates that a new file named
nishanth.txthas been created in the container’s root (/) directory. This file did not exist in the original traefik image
docker history traefik
IMAGE CREATED CREATED BY SIZE COMMENT
a14917e96c7b 3 weeks ago LABEL org.opencontainers.image.vendor=Traefi… 0B buildkit.dockerfile.v0
<missing> 3 weeks ago CMD ["traefik"] 0B buildkit.dockerfile.v0
<missing> 3 weeks ago ENTRYPOINT ["/entrypoint.sh"] 0B buildkit.dockerfile.v0
<missing> 3 weeks ago EXPOSE map[80/tcp:{}] 0B buildkit.dockerfile.v0
<missing> 3 weeks ago COPY entrypoint.sh / # buildkit 419B buildkit.dockerfile.v0
<missing> 3 weeks ago RUN /bin/sh -c set -ex; apkArch="$(apk --pr… 168MB buildkit.dockerfile.v0
<missing> 3 weeks ago RUN /bin/sh -c apk --no-cache add ca-certifi… 1MB buildkit.dockerfile.v0
<missing> 3 weeks ago CMD ["/bin/sh"] 0B buildkit.dockerfile.v0
<missing> 3 weeks ago ADD alpine-minirootfs-3.22.2-x86_64.tar.gz /… 8.32MB buildkit.dockerfile.v0
The docker file is https://github.com/traefik/traefik-library-image/blob/master/v3.5/alpine/Dockerfile, you can relate above with the following
FROM alpine:3.22
RUN apk --no-cache add ca-certificates tzdata
RUN set -ex; \
apkArch="$(apk --print-arch)"; \
case "$apkArch" in \
armhf) arch='armv6' ;; \
aarch64) arch='arm64' ;; \
x86_64) arch='amd64' ;; \
riscv64) arch='riscv64' ;; \
s390x) arch='s390x' ;; \
ppc64le) arch='ppc64le' ;; \
*) echo >&2 "error: unsupported architecture: $apkArch"; exit 1 ;; \
esac; \
wget --quiet -O /tmp/traefik.tar.gz "https://github.com/traefik/traefik/releases/download/v3.5.3/traefik_v3.5.3_linux_$arch.tar.gz"; \
tar xzvf /tmp/traefik.tar.gz -C /usr/local/bin traefik; \
rm -f /tmp/traefik.tar.gz; \
chmod +x /usr/local/bin/traefik
COPY entrypoint.sh /
EXPOSE 80
ENTRYPOINT ["/entrypoint.sh"]
CMD ["traefik"]
# Metadata
LABEL org.opencontainers.image.vendor="Traefik Labs" \
org.opencontainers.image.url="https://traefik.io" \
org.opencontainers.image.source="https://github.com/traefik/traefik" \
org.opencontainers.image.title="Traefik" \
org.opencontainers.image.description="A modern reverse-proxy" \
org.opencontainers.image.version="v3.5.3" \
org.opencontainers.image.documentation="https://docs.traefik.io"
Concepts
History
A Brief History of Containers (by Jeff Victor & Kir Kolyshkin)
-
2005: Open VZ (Open Virtuzzo) is an operating system-level virtualization technology for Linux which uses a patched Linux kernel for virtualization, isolation, resource management and checkpointing. The code was not released as part of the official Linux kernel.
-
Process Containers (launched by Google in 2006) was designed for limiting, accounting and isolating resource usage (CPU, memory, disk I/O, network) of a collection of processes. It was renamed “Control Groups (cgroups)” a year later and eventually merged to Linux kernel 2.6.24.
-
LXC (LinuX Containers) was the first, most complete implementation of Linux container manager. It was implemented in 2008 using cgroups and Linux namespaces, and it works on a single Linux kernel without requiring any patches.
-
Docker also used LXC in its initial stages and later replaced that container manager with its own library, libcontainer.
Ref: https://www.aquasec.com/blog/a-brief-history-of-containers-from-1970s-chroot-to-docker-2016/
LXC and libcontainer
- Linux Containers (LXC) was used before docker 0.9 (On March 13, 2014, with the release of version 0.9, Docker dropped LXC as the default execution environment and replaced it with its own libcontainer library) as one execution driver by docker, and offered a userspace interface for the Linux kernel containment features. It is very specific to Linux
- Docker 0.9 includes 2 major improvements: execution drivers and libcontainer.
- libcontainer, a pure Go library which we developed to access the kernel’s container APIs directly, without any other dependencies.
- Thanks to libcontainer, Docker out of the box can now manipulate namespaces, control groups, capabilities, apparmor profiles, network interfaces and firewalling rules – all in a consistent and predictable way, and without depending on LXC or any other userland package.
- libcontainer (now opencontainers/runc) is an abstraction, in order to support a wider range of isolation technologies as described in this article https://jancorg.github.io/blog/2015/01/03/libcontainer-overview/
Ref: https://stackoverflow.com/questions/34152365/difference-between-lxc-and-libcontainer
- In 2016 the container space was booming and docker decided to split the monolith into separate parts, some of which other projects can even build on — that’s how containerd happened https://blog.docker.com/2016/04/docker-containerd-integration/. That was Docker 1.11 (so pretty much ancient history).
- Containerd is a daemon that acts as API facade for various container runtimes and OS. When using containerd, you no longer work with syscalls, instead you work with higher-level entities like snapshot and container — the rest is abstracted away.
- If you want to understand containerd even more in depth, there’s a design documentation in their GitHub repo https://github.com/containerd/containerd/tree/master/design. Under the hood, containerd uses runc to do all the linux work.
Read more at https://stackoverflow.com/questions/41645665/how-containerd-compares-to-runc
contained, runc, shim
- OCI maintains the Open Container Initiative(OCI) specification for runtime and images. The current docker versions support OCI image and runtime specs.
- containerd is a container runtime which can manage a complete container lifecycle - from image transfer/storage to container execution, supervision and networking.
- container-shim handle headless containers, meaning once runc initializes the containers, it exits handing the containers over to the container-shim which acts as some middleman.
- runc is lightweight universal run time container, which abides by the OCI specification. runc is used by containerd for spawning and running containers according to OCI spec. It is also the repackaging of libcontainer.
- grpc used for communication between containerd and docker-engine.
runc
What happens under the hood when we create a new container on Linux?
- When the command is fired from CLI by the user, it makes an API call to the docker daemon, which then calls containerD via GRPC, which further calls shim process and runC.
- ContainerD handles execution/lifecycle operations like start, stop, pause and unpause. OCI (Open Container Initiative) layer does the interface with the kernel.
- RunC spins up the container and exits, however shim remains connected to the container. This is also the case when multiple containers are spun up.
Ref: https://stackoverflow.com/questions/46649592/dockerd-vs-docker-containerd-vs-docker-runc-vs-docker-containerd-ctr-vs-docker-c
What happens when you run a container
terminal <-> docker <-> dockerd <-> containerd <-> shim <-> application (container)
Ref: https://labs.iximiuz.com/tutorials/docker-run-vs-attach-vs-exec
Ref: https://labs.iximiuz.com/tutorials/docker-run-vs-attach-vs-exec
TODO
https://blog.quarkslab.com/digging-into-runtimes-runc.html
Ref: https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2021/12/22/runc-internals-1
Ref: https://iximiuz.com/en/posts/journey-from-containerization-to-orchestration-and-beyond/