Docker considered harmful
In the last yearly update, I talked about isolating my
self-hosted LLMs running in Ollama, as well as
Open WebUI, in systemd-nspawn
containers. However, as I
contemplated writing such a blog post, I realized the inevitable question would
be: why not run it in Docker?
After all, Docker is super popular in self-hosting circles for its “convenience”
and “security.” There’s a vast repository of images that exist for almost any
software you might want. You could run almost anything you want with a simple
docker run
, and it’ll run securely in a container. What isn’t there to like?
This is probably going to be one of my most controversial blog posts, but the truth is that over the past decade, I’ve run into so many issues with Docker that I’ve simply had enough of it. I now avoid Docker like the plague. In fact, if some software is only available as a Docker container—or worse, requires Docker compose—I sigh and create a full VM to lock away the madness.
This may seem extreme, but fundamentally, this boils down to several things:
- The Docker daemon’s complete overreach;
- Docker’s lack of UID isolation by default;
-
Docker’s lack of
init
by default; and - The quality of Docker images.
Let’s dive into this.
Docker daemon overreach
The Docker daemon basically thinks it owns the whole system. For example, it
completely rearranges the firewall rules on a system by default without ever
asking, instead of deferring to the system administrator. Every time the Docker
daemon starts, it changes iptables
’s FORWARD
chain’s policy to DROP
for no
reason. If you don’t like this, you can disable the firewall changes and
completely break networking in containers.
If you want to use your own firewall rules with Docker, you better get used to hooking into Docker’s firewall rules, and they stop working if Docker isn’t running. It’s Docker’s way… or the highway. No software should act like it owns the whole system.
This proved super irritating when I had BGP VMs doing routing and Docker just kept mangling the firewall whenever it felt like it. As a result, Docker is permanently banned on any machine doing routing on my network.
Docker’s lack of UID isolation by default
By default, Docker chooses to not use UID namespaces. This means that the user
root
inside the Docker container is the root
user on the system. If you
are running a container that assumes “Docker is security” and runs the
application as root
, and that application gets pwned, the oldest layer of
security on Unix-like systems—UID isolation—is already broken.
Will seccomp be good enough to protect your system? Do you really
want to find out?1
This also means that if the container creates a user and runs the application as
that user, that UID can easily collide with a user on the real system, making it
that much easier to take over that user. Now consider that the default UID for
users is 1000
2… it’s quite likely for the container to be running as
the user you are logged in right now! Isn’t that comforting?
You can turn on UID namespaces, but the process is super painful and doing so wipes out the entire Docker state, requiring all images and containers to be recreated. It can also only have one UID namespace for all containers running under the same Docker daemon, which isn’t what I’d consider sufficient isolation between containers.
UID namespaces should be enabled by default, but I assume it isn’t because it’ll break many images. This is the curse of the vast repository of images. For Docker to be secure, it needs to run each container in its own UID namespace by default.
Docker’s lack of init
by default
The other really crazy part is that Docker containers run the ENTRYPOINT
command as PID 1 by default (i.e. as if it’s init
). Most Docker containers are
not built with an actual init
process as the ENTRYPOINT
, so by default,
Docker will just run your application as init
, unless you remember to pass
--init
to docker run
. This has some funky effects, such as:
- Zombies processes are not reaped. It’s the duty of
init
to reap orphaned zombie processes, and your application doesn’t know how to do that. So every time a process becomes orphaned and exits, it lingers around as a zombie until the container stops. Each zombie process consumes system resources and pollutes theps
output, and you can easily end up with thousands of these depending on the application. During my time dealing with Docker over the past decade, I’ve seen these zombies floating around quite often. - PID 1 has special signal handling rules. For this, I’ll instead tell a story from my internship days:
Seven years ago, I was at this company working on an application that could only
be tested in Docker for some reason. The application, some C++ extension running
inside python
, had all the problems mentioned before: the python
process ran
as root
and PID 1.
Now, since the application was pointlessly containerized, it was very annoying
to attach a debugger. So they added SIGSEGV
handler—upon segfaulting, it
printed the stack trace of every thread before the process was allowed to crash.
One day, I accidentally crashed the application. However, to my surprise, it kept running. You’d assume it automatically restarted, but it was super weird that it wasn’t picking up my changes until I restarted it manually. How strange…
As it turned out, this was caused by special signal handling rules for PID 1.
You see, to prevent silly things like kill -SEGV 1
from killing init
and
then crashing the whole system, the kernel tags PID 1 as SIGNAL_UNKILLABLE
.
This means that most signals will actually be eaten, except if the signal
originated from the kernel itself or an ancestor namespace (so you’d be able to
kill it from outside the container). This meant that without a SIGSEGV
handler, the containerized process would have crashed.
However, the SIGSEGV
handler consumed the signal to print the stack trace.
When it’s done, it uninstalled the signal handler and did
kill(getpid(), SIGSEGV)
to trigger what it thought was the default crash
behaviour. However, due to SIGNAL_UNKILLABLE
, the kernel simply ate the signal
and the crashed init
process continued on as if everything was fine.
Needless to say, this is horrifying. When a process crashes, it’s because
it’s unsafe to continue. To have it continue on like this… it boggles the
mind. This is On Error Resume Next
-level3 of horror.
I am not sure why Docker didn’t just run every container with --init
unless
the container specifically marked itself as having its own init
as the
ENTRYPOINT
. Naturally, changing this behaviour now will break a lot of
images, which means it’ll probably never happen.
The quality of Docker images
Because of Docker’s insane defaults, many images suffer from these footguns one
way or another, like running stuff as root
and PID 1. On top of this, I’ve
seen a lot of other horrible practices in Docker images. For example, there was
this “production-ready” image for this Flask app that used Flask’s built-in
development server (i.e. flask run
)—which specifically
warns against being used in production on startup—instead of
something like uwsgi
or gunicorn
. When I brought this
up to the maintainer, I got a dismissive response that “it works”—as if the
development server can scale to handle an actual production load or is hardened
against malicious input.
In order to deploy a Docker image securely, you’d need to fully understand the way the image was built, at which point you might as well just build your own image, instead of hoping that newer versions of the image will also be built securely.
Speaking of newer versions… because Docker images often contain a full OS, however stripped down, it also means that any time anything in the OS has a security hole, a new image is needed. Does your favourite Docker image update every time there’s an OS-level security patch? It probably doesn’t.
This is not helped by the culture around Docker images: most people only care
that the images “work”—as in perform the expected functions, without giving any
consideration to security or even failure cases, as the flask run
example
epitomizes. As a result, a large number of Docker images are fundamentally
insecure and should never be deployed in production.
And thus, the vast repository of images that seemed to be Docker’s main advantage is, in truth, a veritable minefield of security holes.
Other issues
Of course, the problems don’t just end there. For example, Docker makes it quite difficult to deploy IPv6 properly in containers, especially doing so securely, since Docker relies on NAT to avoid exposing all the internal ports to the whole Internet. The only way around this is to… write your own firewall rules, which is especially ironic given the way it treats the firewall.
Due to this inherent difficulty, almost no one enables IPv6 in Docker, which forces containers to keep using IPv4. This undoubtedly hampers global adoption of IPv6 and exacerbates the IPv4 exhaustion problem.
Also, how many times have you run out of disk space when using Docker due to old versions of images floating around?
What’s next?
At this point, even if Docker doesn’t screw up my firewall, the only way I’d be able to sleep at night is to deploy my software myself in a way I know is secure. Since a significant amount of problems have to do with Docker images and their compatibility, a lot of the issues here also apply to compatible implementations like podman or Kubernetes4.
While it is possible to use Docker and compatibles securely in production, this inevitably requires building your own images that you can trust are done right, and updating them constantly to patch OS-level security holes. In such cases, you have to ask yourself: is it truly worth it or does the image building get in the way?
If you are just going to set up a service once or twice, the full overhead of
writing a Dockerfile
and updating it doesn’t seem worth the trouble. It’s
usually only worth it if you need the identical OS image across many machines,
such as in the case of the DMOJ judge, for which I reluctantly bite the
bullet and build custom images5 that are run on VMs whose
only jobs are to run the DMOJ judge.
If you don’t really need containerization or reproducibility, e.g. running the application in its own VM already, just deploy it the traditional way.
If you need containerization but not reproducibility, this is where
systemd-nspawn
comes in handy. It’s effectively the same technology as Docker,
but it acts more like a full virtual machine. You can build the writable OS
image yourself in a secure fashion, it’s a properly installed OS with a full
init
, and you can turn on UID namespaces with a single line in the
configuration. Since it’s a full OS, you can easily keep it up-to-date with
unattended-upgrades
or your distro’s equivalent. I’ll talk more about it next
time.
If you need containerization and reproducibility, then it makes sense at last to use Docker or a compatible solution. You should only run trusted images—either your own or ones you absolutely trust are done right—and only then, on (virtual) machines dedicated to running such images.
And finally, if your favourite software forces you to deploy it with Docker6, then locking it away in a VM will at least protect everything else outside… If you maintain such software, please seriously consider giving users an option to not use Docker.
Notes
-
I have friends who ran Jellyfin in Docker and ended up with a cryptominer installed on the machine outside of the container, so I’ll let that speak for itself. ↩
-
I’ve actually seen containers advertising this as a feature to easily enable access to your files inside the container… Why they’d do this instead of just setting up Unix permission (or ACLs) properly is beyond me. Sure, it might access the desired files easily, but how do you feel about it accessing your bank account? ↩
-
For those too young to know,
On Error Resume Next
is a statement you can put into a Visual Basic program to make it ignore all errors and just keep running. This obviously created more problems… ↩ -
Kubernetes famously doesn’t have an equivalent to
docker run --init
, so trying to run any Docker image that doesn’t have aninit
is asking for trouble. ↩ -
Building the DMOJ runtimes docker image takes half an hour, so fixing image issues is super painful and slow to iterate. The reproducibility and automatic CI building requirements are the only reasons why we even bother. ↩
-
I am looking at you, Immich. And no, the “unofficial” approach of dissecting the
Dockerfile
and installing it manually isn’t a sane approach either when every other release has breaking changes. It really doesn’t help that your Docker compose runs Immich asroot
andpostgres
andredis
as PID 1 in their containers… ↩