Docker considered harmful

In the last yearly update, I talked about isolating my self-hosted LLMs running in Ollama, as well as Open WebUI, in systemd-nspawn containers. However, as I contemplated writing such a blog post, I realized the inevitable question would be: why not run it in Docker?

After all, Docker is super popular in self-hosting circles for its “convenience” and “security.” There’s a vast repository of images that exist for almost any software you might want. You could run almost anything you want with a simple docker run, and it’ll run securely in a container. What isn’t there to like?

This is probably going to be one of my most controversial blog posts, but the truth is that over the past decade, I’ve run into so many issues with Docker that I’ve simply had enough of it. I now avoid Docker like the plague. In fact, if some software is only available as a Docker container—or worse, requires Docker compose—I sigh and create a full VM to lock away the madness.

This may seem extreme, but fundamentally, this boils down to several things:

The Docker daemon’s complete overreach;
Docker’s lack of UID isolation by default;
Docker’s lack of init by default; and
The quality of Docker images.

Let’s dive into this.

Docker daemon overreach

The Docker daemon basically thinks it owns the whole system. For example, it completely rearranges the firewall rules on a system by default without ever asking, instead of deferring to the system administrator. Every time the Docker daemon starts, it changes iptables’s FORWARD chain’s policy to DROP for no reason. If you don’t like this, you can disable the firewall changes and completely break networking in containers.

If you want to use your own firewall rules with Docker, you better get used to hooking into Docker’s firewall rules, and they stop working if Docker isn’t running. It’s Docker’s way… or the highway. No software should act like it owns the whole system.

This proved super irritating when I had BGP VMs doing routing and Docker just kept mangling the firewall whenever it felt like it. As a result, Docker is permanently banned on any machine doing routing on my network.

Docker’s lack of UID isolation by default

By default, Docker chooses to not use UID namespaces. This means that the user root inside the Docker container is the root user on the system. If you are running a container that assumes “Docker is security” and runs the application as root, and that application gets pwned, the oldest layer of security on Unix-like systems—UID isolation—is already broken. Will seccomp be good enough to protect your system? Do you really want to find out?¹

This also means that if the container creates a user and runs the application as that user, that UID can easily collide with a user on the real system, making it that much easier to take over that user. Now consider that the default UID for users is 1000²… it’s quite likely for the container to be running as the user you are logged in right now! Isn’t that comforting?

You can turn on UID namespaces, but the process is super painful and doing so wipes out the entire Docker state, requiring all images and containers to be recreated. It can also only have one UID namespace for all containers running under the same Docker daemon, which isn’t what I’d consider sufficient isolation between containers.

UID namespaces should be enabled by default, but I assume it isn’t because it’ll break many images. This is the curse of the vast repository of images. For Docker to be secure, it needs to run each container in its own UID namespace by default.

Docker’s lack of `init` by default

The other really crazy part is that Docker containers run the ENTRYPOINT command as PID 1 by default (i.e. as if it’s init). Most Docker containers are not built with an actual init process as the ENTRYPOINT, so by default, Docker will just run your application as init, unless you remember to pass --init to docker run. This has some funky effects, such as:

Zombies processes are not reaped. It’s the duty of init to reap orphaned zombie processes, and your application doesn’t know how to do that. So every time a process becomes orphaned and exits, it lingers around as a zombie until the container stops. Each zombie process consumes system resources and pollutes the ps output, and you can easily end up with thousands of these depending on the application. During my time dealing with Docker over the past decade, I’ve seen these zombies floating around quite often.
PID 1 has special signal handling rules. For this, I’ll instead tell a story from my internship days:

Seven years ago, I was at this company working on an application that could only be tested in Docker for some reason. The application, some C++ extension running inside python, had all the problems mentioned before: the python process ran as root and PID 1.

Now, since the application was pointlessly containerized, it was very annoying to attach a debugger. So they added SIGSEGV handler—upon segfaulting, it printed the stack trace of every thread before the process was allowed to crash.

One day, I accidentally crashed the application. However, to my surprise, it kept running. You’d assume it automatically restarted, but it was super weird that it wasn’t picking up my changes until I restarted it manually. How strange…

As it turned out, this was caused by special signal handling rules for PID 1. You see, to prevent silly things like kill -SEGV 1 from killing init and then crashing the whole system, the kernel tags PID 1 as SIGNAL_UNKILLABLE. This means that most signals will actually be eaten, except if the signal originated from the kernel itself or an ancestor namespace (so you’d be able to kill it from outside the container). This meant that without a SIGSEGV handler, the containerized process would have crashed.

However, the SIGSEGV handler consumed the signal to print the stack trace. When it’s done, it uninstalled the signal handler and did kill(getpid(), SIGSEGV) to trigger what it thought was the default crash behaviour. However, due to SIGNAL_UNKILLABLE, the kernel simply ate the signal and the crashed init process continued on as if everything was fine.

Needless to say, this is horrifying. When a process crashes, it’s because it’s unsafe to continue. To have it continue on like this… it boggles the mind. This is On Error Resume Next-level³ of horror.

I am not sure why Docker didn’t just run every container with --init unless the container specifically marked itself as having its own init as the ENTRYPOINT. Naturally, changing this behaviour now will break a lot of images, which means it’ll probably never happen.

The quality of Docker images

Because of Docker’s insane defaults, many images suffer from these footguns one way or another, like running stuff as root and PID 1. On top of this, I’ve seen a lot of other horrible practices in Docker images. For example, there was this “production-ready” image for this Flask app that used Flask’s built-in development server (i.e. flask run)—which specifically warns against being used in production on startup—instead of something like uwsgi or gunicorn. When I brought this up to the maintainer, I got a dismissive response that “it works”—as if the development server can scale to handle an actual production load or is hardened against malicious input.

In order to deploy a Docker image securely, you’d need to fully understand the way the image was built, at which point you might as well just build your own image, instead of hoping that newer versions of the image will also be built securely.

Speaking of newer versions… because Docker images often contain a full OS, however stripped down, it also means that any time anything in the OS has a security hole, a new image is needed. Does your favourite Docker image update every time there’s an OS-level security patch? It probably doesn’t.

This is not helped by the culture around Docker images: most people only care that the images “work”—as in perform the expected functions, without giving any consideration to security or even failure cases, as the flask run example epitomizes. As a result, a large number of Docker images are fundamentally insecure and should never be deployed in production.

And thus, the vast repository of images that seemed to be Docker’s main advantage is, in truth, a veritable minefield of security holes.

Other issues

Of course, the problems don’t just end there. For example, Docker makes it quite difficult to deploy IPv6 properly in containers, especially doing so securely, since Docker relies on NAT to avoid exposing all the internal ports to the whole Internet. The only way around this is to… write your own firewall rules, which is especially ironic given the way it treats the firewall.

Due to this inherent difficulty, almost no one enables IPv6 in Docker, which forces containers to keep using IPv4. This undoubtedly hampers global adoption of IPv6 and exacerbates the IPv4 exhaustion problem.

Also, how many times have you run out of disk space when using Docker due to old versions of images floating around?

What’s next?

At this point, even if Docker doesn’t screw up my firewall, the only way I’d be able to sleep at night is to deploy my software myself in a way I know is secure. Since a significant amount of problems have to do with Docker images and their compatibility, a lot of the issues here also apply to compatible implementations like podman or Kubernetes⁴.

While it is possible to use Docker and compatibles securely in production, this inevitably requires building your own images that you can trust are done right, and updating them constantly to patch OS-level security holes. In such cases, you have to ask yourself: is it truly worth it or does the image building get in the way?

If you are just going to set up a service once or twice, the full overhead of writing a Dockerfile and updating it doesn’t seem worth the trouble. It’s usually only worth it if you need the identical OS image across many machines, such as in the case of the DMOJ judge, for which I reluctantly bite the bullet and build custom images⁵ that are run on VMs whose only jobs are to run the DMOJ judge.

If you don’t really need containerization or reproducibility, e.g. running the application in its own VM already, just deploy it the traditional way.

If you need containerization but not reproducibility, this is where systemd-nspawn comes in handy. It’s effectively the same technology as Docker, but it acts more like a full virtual machine. You can build the writable OS image yourself in a secure fashion, it’s a properly installed OS with a full init, and you can turn on UID namespaces with a single line in the configuration. Since it’s a full OS, you can easily keep it up-to-date with unattended-upgrades or your distro’s equivalent. I’ll talk more about it next time.

If you need containerization and reproducibility, then it makes sense at last to use Docker or a compatible solution. You should only run trusted images—either your own or ones you absolutely trust are done right—and only then, on (virtual) machines dedicated to running such images.

And finally, if your favourite software forces you to deploy it with Docker⁶, then locking it away in a VM will at least protect everything else outside… If you maintain such software, please seriously consider giving users an option to not use Docker.

Notes

I have friends who ran Jellyfin in Docker and ended up with a cryptominer installed on the machine outside of the container, so I’ll let that speak for itself. ↩
I’ve actually seen containers advertising this as a feature to easily enable access to your files inside the container… Why they’d do this instead of just setting up Unix permission (or ACLs) properly is beyond me. Sure, it might access the desired files easily, but how do you feel about it accessing your bank account? ↩
For those too young to know, On Error Resume Next is a statement you can put into a Visual Basic program to make it ignore all errors and just keep running. This obviously created more problems… ↩
Kubernetes famously doesn’t have an equivalent to docker run --init, so trying to run any Docker image that doesn’t have an init is asking for trouble. ↩
Building the DMOJ runtimes docker image takes half an hour, so fixing image issues is super painful and slow to iterate. The reproducibility and automatic CI building requirements are the only reasons why we even bother. ↩
I am looking at you, Immich. And no, the “unofficial” approach of dissecting the Dockerfile and installing it manually isn’t a sane approach either when every other release has breaking changes. It really doesn’t help that your Docker compose runs Immich as root and postgres and redis as PID 1 in their containers… ↩