Kernel capabilities turn the binary “root/non-root” dichotomy into a fine-grained access control system. As was seen in the user namespace remapping post, the default user within a container is root, which is also root for the host machine. However, Docker drops most of the kernel capabilities for the container’s process. Which means that the root within a container has less privileges than the root of the host; although it is still quite powerful.
The default kernel capabilities available to the container root are found in this part of the Docker’s source code: https://github.com/moby/moby/blob/master/oci/caps/defaults.go
This can be checked by running a container and inspecting its capabilities. It is possible to do that either installing libcap in the container and using the capsh --print command or getting the PID of a container’s process and using the getpcaps command in the host machine.
The figure below shows the default kernel capabilities allowed in the container's process.
It is good to mention that the terribly dangerous --privileged flag - that can be used at container’s execution - enables all the kernel capabilities. It totally breaks isolation. If an attacker compromises a container running with --privileged flag, they have also compromised the host.
The figure below shows that the container's process running with the --privileged flag has the full kernel capabilities available.
Fine-grained access control system
In most cases, containers don't need all the root privileges. Therefore, they can run with a reduced capability set; meaning that root within a container will have even less privileges than default.
Docker supports the addition and/or removal of kernel capabilities, allowing use of a non-default profile. This may make Docker more secure through capability removal, or less secure through the addition of capabilities. The best practice is to remove all capabilities except those explicitly required for the container to be run.
The parameter --cap-drop=all allows all the allowed default kernel capabilities to be dropped. For example, if a web service is to be run, then the following command might be executed:
However, an error will be returned due to the removal of the capability “cap_net_bind_service” that is necessary to bind privileged ports ( < 1024). In this case the port 80 for the web service.
The necessary capability for this requirement can be enabled, whilst dropping all the rest.
In this way the web service will be executed correctly.
It is now possible to verify that the container is correctly running, with only the specified capability.
The image above shows that the httpd process of the container has only the cap_net_bind_service capability as was specified.
This practice significantly improves the security of containerized environments.
Do you want to learn more?
https://dreamlab.net/en/education/trainings-schedule/
References
https://docs.docker.com/engine/security/security/
Sheila A. Berta
Head of Research at Dreamlab Technologies