Container 101

Posted by Jason Feng on January 9, 2021

Container is just another process in Linux. It is extracted from tarballs, bound to namespaces and limited by cgroups. Let us take a closer look.

Introduction

Container or Docker has gained popularity in recent years because it accelerates the life-cycle of application development. Developers can build the application and the dependancies into a docker image. It can then be easily accessed for testing, deployment and scaling. Comparing to virtual machine, container is a light-weight sandbox which provides isolation for the application running inside. Under the hood, it is just a common process in Linux yet utilising the Linux techniques such as rootfs, namespace, and cgroup.

Isolation with Namespaces

We run a docker container busybox with a shell, and login to it. Then we issue two sleep in the backgroup. Let us have a look at the processes via ps inside the container.

~$ docker run -it busybox /bin/sh
/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
   11 root      0:00 ps
/ # sleep 500 &
/ # sleep 400 &
/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
   12 root      0:00 sleep 500
   13 root      0:00 sleep 400
   14 root      0:00 ps

It looks like we have an isolated environment where PID 1 is the shell command and the rest are is the child processes. Is it a real separated environment similar to what a virtual machine provides? What if checking the processes from the host?

First, we can obtain the pid (5777 in this instance) in the host via docker inspect command. Then we check the child processes of pid 5777.

(base) qfen8290@qfen8290-Inspiron-5555:~$ docker ps
CONTAINER ID   IMAGE     COMMAND     CREATED          STATUS          PORTS     NAMES
b1fd7de96a9e   busybox   "/bin/sh"   24 seconds ago   Up 23 seconds             gracious_gagarin

(base) qfen8290@qfen8290-Inspiron-5555:~$ docker inspect gracious_gagarin | jq '.[].State.Pid'
5777
(base) qfen8290@qfen8290-Inspiron-5555:~$ ps -eaf | grep 5777 | grep -v grep
root      5777  5743  0 17:20 pts/0    00:00:00 /bin/sh
root      5845  5777  0 17:20 pts/0    00:00:00 sleep 500
root      5847  5777  0 17:20 pts/0    00:00:00 sleep 400

From the host, we can see that the shell command and the two sleep commands we run in the container are just another processes we are familiar with. Docker is just applying Linux PID namespace to construct an environment which is camouflaged as an isolated one.

We know the namespace concept when programming in Python that we can only see the variables defined inside the function. Similarly, when starting a process in Linux, we can set the process to adhere to the namespaces. Type man namespaces to get more information. The available namespaces are mount, UTS (hostname), network, PID, etc.

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

With nsenter command, we can enter the namespaces of the container we just run. So magic happens, we can see that the same output of ps as inside the container.

(base) qfen8290@qfen8290-Inspiron-5555:~$ sudo nsenter --target 5777  --mount --uts --ipc --net --pid ps
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
   12 root      0:00 sleep 500
   13 root      0:00 sleep 400
   15 root      0:00 ps

Similarly, we can check the hostname, network config namespaces as below.

This is run inside the container.

/ # hostname
b1fd7de96a9e
/ # ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02  
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:43 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:6617 (6.4 KiB)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

/ # 

This is the corresponding nsenter commands run in the host and the output.

(base) qfen8290@qfen8290-Inspiron-5555:~$ sudo nsenter --target 5777  --mount --uts --ipc --net --pid hostname
b1fd7de96a9e
(base) qfen8290@qfen8290-Inspiron-5555:~$ sudo nsenter --target 5777  --mount --uts --ipc --net --pid ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02  
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:46 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7248 (7.0 KiB)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Resource Limitation with cgroup

With cgroup, it can apply limitation of system resources such as CPU time, memory usage to the process. Let us run a docker container with memory of 400MB.

(base) qfen8290@qfen8290-Inspiron-5555:~$ docker run -it -m 400m busybox /bin/sh

Then we get the docker container Id. With it, we can check the memory limitation which is controlled by cgroup in /sys/fs/cgroup/memory/docker/<continer id>/memory.limit_in_bytes.

(base) qfen8290@qfen8290-Inspiron-5555:~$ docker ps
CONTAINER ID   IMAGE     COMMAND     CREATED          STATUS          PORTS     NAMES
90562d66d7ef   busybox   "/bin/sh"   35 seconds ago   Up 32 seconds             mystifying_spence
(base) qfen8290@qfen8290-Inspiron-5555:~$ docker inspect mystifying_spence | jq '.[].Id'
"90562d66d7efc0b9dddd913488ce47a7ee2fe96bc717473b75eb767d4ace4eae"
(base) qfen8290@qfen8290-Inspiron-5555:~$ cat /sys/fs/cgroup/memory/docker/90562d66d7efc0b9dddd913488ce47a7ee2fe96bc717473b75eb767d4ace4eae/memory.limit_in_bytes 
419430400

Similarly, we can set the limitation for CPU times, network, etc..

Packaging files and dependencies with rootfs

It is also required to have an isolated file system for the container. This can be achieved with the Linux command pivot_root and chroot, together with Mount namespace to mount a specific directory as the root “/” for the container. All the dependency libraries including the OS filesystem can be packaged into the container image. This brings the great benefit of creating a consistent environment for the application.

Docker introduced the concept of layer when designing the image build. This is a smart way that rootfs can be built incrementally with the help of aufs and overlay2.

Reference

Image by Free-Photos from Pixabay