Container is just another process in Linux. It is extracted from tarballs, bound to namespaces and limited by cgroups. Let us take a closer look.
Introduction
Container or Docker has gained popularity in recent years because it accelerates the life-cycle of application development. Developers can build the application and the dependancies into a docker image. It can then be easily accessed for testing, deployment and scaling. Comparing to virtual machine, container is a light-weight sandbox which provides isolation for the application running inside. Under the hood, it is just a common process in Linux yet utilising the Linux techniques such as rootfs, namespace, and cgroup.
Isolation with Namespaces
We run a docker container busybox with a shell, and login to it. Then we issue two sleep
in the backgroup. Let us have a look at the processes via ps
inside the container.
~$ docker run -it busybox /bin/sh
/ # ps
PID USER TIME COMMAND
1 root 0:00 /bin/sh
11 root 0:00 ps
/ # sleep 500 &
/ # sleep 400 &
/ # ps
PID USER TIME COMMAND
1 root 0:00 /bin/sh
12 root 0:00 sleep 500
13 root 0:00 sleep 400
14 root 0:00 ps
It looks like we have an isolated environment where PID 1 is the shell command and the rest are is the child processes. Is it a real separated environment similar to what a virtual machine provides? What if checking the processes from the host?
First, we can obtain the pid (5777 in this instance) in the host via docker inspect
command. Then we check the child processes of pid 5777.
(base) qfen8290@qfen8290-Inspiron-5555:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b1fd7de96a9e busybox "/bin/sh" 24 seconds ago Up 23 seconds gracious_gagarin
(base) qfen8290@qfen8290-Inspiron-5555:~$ docker inspect gracious_gagarin | jq '.[].State.Pid'
5777
(base) qfen8290@qfen8290-Inspiron-5555:~$ ps -eaf | grep 5777 | grep -v grep
root 5777 5743 0 17:20 pts/0 00:00:00 /bin/sh
root 5845 5777 0 17:20 pts/0 00:00:00 sleep 500
root 5847 5777 0 17:20 pts/0 00:00:00 sleep 400
From the host, we can see that the shell command and the two sleep
commands we run in the container are just another processes we are familiar with. Docker is just applying Linux PID namespace to construct an environment which is camouflaged as an isolated one.
We know the namespace concept when programming in Python that we can only see the variables defined inside the function. Similarly, when starting a process in Linux, we can set the process to adhere to the namespaces. Type man namespaces
to get more information. The available namespaces are mount, UTS (hostname), network, PID, etc.
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.
With nsenter
command, we can enter the namespaces of the container we just run. So magic happens, we can see that the same output of ps
as inside the container.
(base) qfen8290@qfen8290-Inspiron-5555:~$ sudo nsenter --target 5777 --mount --uts --ipc --net --pid ps
PID USER TIME COMMAND
1 root 0:00 /bin/sh
12 root 0:00 sleep 500
13 root 0:00 sleep 400
15 root 0:00 ps
Similarly, we can check the hostname, network config namespaces as below.
This is run inside the container.
/ # hostname
b1fd7de96a9e
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:02
inet addr:172.17.0.2 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:43 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:6617 (6.4 KiB) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
/ #
This is the corresponding nsenter
commands run in the host and the output.
(base) qfen8290@qfen8290-Inspiron-5555:~$ sudo nsenter --target 5777 --mount --uts --ipc --net --pid hostname
b1fd7de96a9e
(base) qfen8290@qfen8290-Inspiron-5555:~$ sudo nsenter --target 5777 --mount --uts --ipc --net --pid ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:02
inet addr:172.17.0.2 Bcast:172.17.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:46 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:7248 (7.0 KiB) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Resource Limitation with cgroup
With cgroup, it can apply limitation of system resources such as CPU time, memory usage to the process. Let us run a docker container with memory of 400MB.
(base) qfen8290@qfen8290-Inspiron-5555:~$ docker run -it -m 400m busybox /bin/sh
Then we get the docker container Id. With it, we can check the memory limitation which is controlled by cgroup in /sys/fs/cgroup/memory/docker/<continer id>/memory.limit_in_bytes
.
(base) qfen8290@qfen8290-Inspiron-5555:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
90562d66d7ef busybox "/bin/sh" 35 seconds ago Up 32 seconds mystifying_spence
(base) qfen8290@qfen8290-Inspiron-5555:~$ docker inspect mystifying_spence | jq '.[].Id'
"90562d66d7efc0b9dddd913488ce47a7ee2fe96bc717473b75eb767d4ace4eae"
(base) qfen8290@qfen8290-Inspiron-5555:~$ cat /sys/fs/cgroup/memory/docker/90562d66d7efc0b9dddd913488ce47a7ee2fe96bc717473b75eb767d4ace4eae/memory.limit_in_bytes
419430400
Similarly, we can set the limitation for CPU times, network, etc..
Packaging files and dependencies with rootfs
It is also required to have an isolated file system for the container. This can be achieved with the Linux command pivot_root
and chroot
, together with Mount namespace to mount a specific directory as the root “/” for the container. All the dependency libraries including the OS filesystem can be packaged into the container image. This brings the great benefit of creating a consistent environment for the application.
Docker introduced the concept of layer when designing the image build. This is a smart way that rootfs can be built incrementally with the help of aufs and overlay2.
Reference
Image by Free-Photos from Pixabay