Docker from scratch

(part of the series From scratch)

Guy Waldman

June 1, 2026 (2 months ago)

One of my absolute favorite things in software is picking up something and building it by hand from the ground up.
I love implementing tools, protocols, etc. from the ground up to get a mental model of how they really work, and you learn over the process.

This is the first post in this series where we build things from scratch (well, "from scratch" within reason) and I'm starting off with something that for me personally is always very exciting - containers! Next up on this series - building our own AI "agent" from scratch (stay tuned).

When containers started to become a thing, it felt like black magic to me.
How can you spin up Postgres consistently across devices, and it starts up lightning fast?
How amazing it is that we can run 20 freaking Linux on our Mac/Windows laptops in a second?

Anyway, we all know and use Docker daily as software developers.
And we also use containers a lot when we develop locally and in our clouds.
But did you ever ask yourself how Docker really works and what containers actually consist of?

We will develop something resembling Docker from scratch.
As you will see later on, it is not accurate to say that we are totally building Docker, but rather a toy container engine/runtime.

In either caes, this hands-on walkthrough will help us build a mental model of what really happens under the hood, and that is my most important message here.

Short note beore we begin - there cannot be enough credit given to iximiuz. I've been reading their technical posts for a few years now and even pay for the paid series, and it is simply a gold mine.

What are containers anyway?

Containers are a means to run software with several requirements.
To explain them, let's break down some of them:

Portable (run consistently across environments/devices/hardware)
Reproducible (same image will result in the same behavior)
Lightweight (fast startup, low overhead, etc.)
Isolated (apps should not interfere with each other or the host)
Secure-ish (not a hard security boundary, but should have limited permissions/access and reduce blast radius of a compromised process)

What possible infamous open source project can be used to make all this happen?

Hello, Linux!

Technically, there are also Windows containers, and newer macOS container tooling exists too.
Those systems have different implementation details, but the high-level ideas should be easier to understand after understanding the Linux model.

We'll focus on Linux for the purpose of this post, and leave Windows/macOS-specific details out

So, the Linux kernel provides the basic semantics to make containers happen, and can help us create a special process that makes a container, well, a container.

So the first thing to understand is crucial, and is the basis for the rest of this post:

Containers are Linux processes

This was the real shocker from me when I first understood this - it's all about processes! Every container is, in its own very basic form, a process with sprinkles on top. Blows my mind every time I think about it.

So, Linux! Linux provides semantics we can use to make a process truly isolated.

The most important ones to understand are namespaces, cgroups, filesystem isolation, the networking stack and security primitives.
There may be more features of the kernel that are at play here, but those are the ones that are usually talked about (and honestly, the only ones I really cared to understand).
We'll work step by step, and eventually explain how these are used to make the container we know and love.

The rest are higher level concepts which are much easier to understand (such as OCI and OCI registries, which we will break down later).

But before we go into all that, let's create a simple (and bad) container, and work our way upwards into isolating it and securing it.

Some ground rules first:

We can't use docker or containerd (more on that soon) for the actual implementation, which defeats the entire purpose of this educational exercise
Simplify aggressively - the actual API surface of these tools is much greater than what we cover in this educational exercise

Setting up the lab

So, to run a container we need Linux. For simplicity, we could use a Linux container which can illustrate some of the concepts.

However, for a closer experience to what Docker really does, let's set up a real live Linux VM.

My daily driver is an ARM-based Macbook, so this should work on M1+ macOS, however the concept should trivially translate to other architectures or devices.

Docker architecture

This is a basic mental model to understand how running docker CLI commands works:

Loading diagram...

The docker CLI is the docker build and docker run you run every day, there's really nothing special about it. It speaks to a daemon process called dockerd, which runs inside the VM.

dockerd, in turn, speaks to containerd which is the core container platform daemon - it handles the container lifecycle, and handles pushing/pulling/storing/unpacking images, running/stopping/monitoring containers (delegates this to the container runtime) and more.
When a container starts, containerd launches a dedicated process for it called containerd-shim. Its role, as I understand it, is to basically be the caretaker of a single process, and mostly decouple the responsibility of managing that container from containerd (also important in case containerd dies and the container needs to keep running).

The way I think about it, containerd is like a surrogate that takes care of the container up until it's able to run, then containerd-shim takes over.

After all this, there is the container runtime (popular ones are runc/crun) which is what makes the the container really be able to run.
It's often called an OCI runtime, and we will understand why a bit later down the post.
It does a lot of setup to isolate and safeguard the process which we will implement ourselves later on.

So for Docker Desktop on macOS, the Docker Linux VM is where all the magic happens, and these daemons all run inside it.
Docker CLI basically talks to dockerd that is running in the Docker VM (technically a socket exposed on the host which proxies to the VM, but you get the point).

An interesting thing to note, is that when you spin up a container from an image that's already pulled, it happens almost instantly - logically, it means that the image artifacts themselves are stored inside the VM, and this is indeed what happens in practice; the VM holds the "image cache", and we'll soon understand what this really means.

If you don't believe me that there is an actual VM running, on macOS you can actually see the Docker VM by running:

ls ~/Library/Containers/com.docker.docker/Data/vms

And you will even see all of your container IDs here:

docker run -it --rm --privileged --pid=host justincormack/nsenter1
$ ls /var/lib/docker/containers

Takeaways

The magic of containers is all about Linux processes

In the Docker Desktop on macOS example, there is a Linux VM used to run the containers

The machinery involved in getting a container to run involves a few daemons along the way, each handling a different layer of responsibility (images/registries are higher, containers are lower)

The container runtime does all the grunt work which we will soon try to implement ourselves

Spinning up our own Linux VM

QEMU (which I've actually just learned stands for Quick Emulator!) is a very popular and reliable hypervisor and emulator. I used it in the past to implement a toy OS, and this was my immediate go-to for providing the Linux host we will want to run containers on.

From some background research, Docker Desktop on Mac macOS actually used QEMU for x86 up until a while ago, but is now considered deprecatd. On ARM, either AVF (Apple Virtualization Framework) is used or a special Docker VMM (Virtual Machine Manager.)
See the docs for Docker VMM and Docker Desktop networking.

In either case, a very easy way to spin up a VM on macOS is AVF with lima.

Lima is awesome! Let's try it out.

limactl start template://ubuntu

That's it! Behind the scenes, it downloads the Ubuntu image (e.g., https://cloud-images.ubuntu.com/releases/questing/release-20260320/ubuntu-25.10-server-cloudimg-arm64.img).
This is an interesting point, actually - this is not your standard Ubuntu Server installer ISO, but rather a cloud image. It is a pre-installed Ubuntu that is optimized for VMs and cloud boot. It is what cloud providers essentially do under the hood for managed VM services.

A super interesting question to ask, is how x86 containers can run on an ARM-based Macbook?

Well, a lot of it is down to Rosetta, Apple's translation layer that allows x86 programs to run on ARM Macs (originally built for the x86->ARM transition).
It basically translatex x86_64 instructions to ARM64 on the fly and with often a minor performance hit which is pretty incredible.

Rosetta is what allows us to run an x86 container on an ARM device on macOS.

`toydocker` boilerplate

Let's create a toydocker Shell script that will serve as the basis for some commands we want to implement (in a very basic form and not in compliance with the actual docker CLI commands):

toydocker.sh
#!/usr/bin/env bash

set -euo pipefail

function build {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 build <image_name>" >&2
		exit 1
	fi

	echo "TODO: Implement 'toydocker build'" >&2
	exit 1
}

function run {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 run <image_name>" >&2
		exit 1
	fi

	echo "TODO: Implement 'toydocker run'" >&2
	exit 1
}

function pull {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 pull <image_name>" >&2
		exit 1
	fi

	echo "TODO: Implement 'toydocker pull'" >&2
	exit 1
}

function main {
	local cmd="${1:-}"
	shift || true

	case "$cmd" in
		build)
			build "$@"
			;;
		run)
			run "$@"
			;;
		pull)
			pull "$@"
			;;
		*)
			echo "Usage: $0 <build|run|pull>"
			exit 1
			;;
	esac
}

main "$@"

Implementing `toydocker build`

Before implementing docker build, we'll need to understand how images are represented. To do this, let's reverse engineer a container.

What actually is an image?

Everything around containers, images, registries and the ecosystem around them revolves around OCI (Open Container Initiative). OCI is a project under the Linux Foundation umbrella, with the goal of creating vendor-neutral, open industry standards for container formats and runtimes.

OCI defines standards such as formats and schemas for how OCI artifacts (such as images) are represented in the file system.

Images, as you probably know, are the "recipe" for containers.
This is wildly inaccurate, but personally I like to think of it a static paused state of a container, after which you can instantiate infinite containers off the same image and essentially "continue" their run.
More concretely, an image is a filesystem blueprint plus metadata: the layers that make up the root filesystem, and configuration such as environment variables, working directory, exposed ports, entrypoint and default command.

One common way to store an image locally is an OCI image archive, which is a tar archive whose contents follow the OCI image spec.

For example, unpacking an OCI image archive will usually contain:

oci-layout - identifies this directory/archive as an OCI image layout
index.json - a high-level index that points to one or more image manifests (for example, separate manifests for different CPU architectures)
blobs/ - content-addressed blobs, keyed by digest; this is where the image manifest, image config and layer tarballs live.

Note that manifest and config are usually not top-level files named manifest.json and config.json in an OCI archive. Instead, index.json points to a manifest blob, the manifest points to the config blob and layer blobs, and the blobs are stored under paths like blobs/sha256/<digest>.

There is a related but separate concept called an OCI runtime bundle.
An image archive is storage/distribution format. A runtime bundle is what low-level runtimes like runc/crun consume: a prepared root filesystem plus a runtime config.json.

In real container stacks, something above runc unpacks/resolves the image, prepares the root filesystem and runtime config, and then asks the OCI runtime to create the process.
runc is more of the engine that takes over after the setup.

Now let's build an OCI archive using buildah.

Let's install it on the VM: limactl shell ubuntu -- sudo apt-get install -y buildah

Let's create a basic Dockerfile on the host:

FROM alpine:3.20
RUN echo "Hello!" > /hello.txt
CMD ["cat", "/hello.txt"]

And then try to use buildah to build a proper OCI-compliant image.

Let's make some changes to toydocker:

toydocker
function build {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 build <image_name>" >&2
		exit 1
	fi

	# Copy the Dockerfile to the Lima VM
	limactl copy Dockerfile ubuntu:/tmp/Dockerfile
	# Use buildah to build the image inside the Lima VM
	limactl shell ubuntu -- sudo buildah bud --storage-driver vfs -t "$name" -f /tmp/Dockerfile .
}

The --storage-driver vfs defines the driver for how the container/image layers are stored and combined into a usable filesystem. We'll go over this in detail when we create a container more in-line with how container runtimes actually do it, for now simply trust the process.

Let's try building:

$ ./toydocker build basicimg
STEP 1/3: FROM alpine:3.20
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/shortnames.conf)
Trying to pull docker.io/library/alpine:3.20...
Getting image source signatures
Copying blob 3f26bc2dec0b done   | 
Copying config ab3fe4defd done   | 
Writing manifest to image destination
STEP 2/3: RUN echo "Hello!" > /hello.txt
STEP 3/3: CMD ["cat", "/hello.txt"]
COMMIT basicimg
Getting image source signatures
Copying blob 88b4fba61c4c skipped: already exists  
Copying blob 9ffcab98c489 done   | 
Copying config 466c9fe1e1 done   | 
Writing manifest to image destination
--> 466c9fe1e124
Successfully tagged localhost/basicimg:latest
466c9fe1e124843d36fafcc1494438182f74dbe563bd3ac458d0c6377e72b342

At this point, we are actually cheating a bit, since Lima comes preinstalled with a few container tools that do a lot of heavy lifting.

To illustrate this, let's make two tools called crun and runc not immediately accessible from $PATH:

sudo mv /usr/local/bin/runc /usr/local/bin/runc.bak
sudo mv /usr/bin/crun /usr/bin/crun.bak

And also prune the cached image layers (we'll talk about them soon) that buildah creates:

buildah prune --all

Building again will error out:

$ ./toydocker build basicimg
STEP 1/3: FROM alpine:3.20
STEP 2/3: RUN echo "Hello!" > /hello.txt
error running container: from  creating container for [/bin/sh -c echo "Hello!" > /hello.txt]: : exec: no command
ERRO[0000] did not get container create message from subprocess: EOF
Error: building at STEP "RUN echo "Hello!" > /hello.txt": while running runtime: exit status 1

The most important part is "did not get container create message from subprocess".

At this point, you may ask yourself how the image is built when there are commands such as RUN which actually require a container themselves; these are temporary build containers.

So building a container requires a container, and to demonstrate it, it will work if we explicitly set the container runtime to one of the binaries we just renamed:

toydocker
# ...
function build {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 build <image_name>" >&2
		exit 1
	fi

	# Copy the Dockerfile to the Lima VM
	limactl copy Dockerfile ubuntu:/tmp/Dockerfile
	# Use buildah to build the image inside the Lima VM
	limactl shell ubuntu -- sudo buildah bud --storage-driver vfs \
		--runtime /usr/local/bin/runc.bak \
		-t "$name" -f /tmp/Dockerfile .
}

# ...

Now ./toydocker build basicimg will work as before. We can also rename them back to get back to the old state and remove the --runtime flag:

sudo mv /usr/bin/crun.bak /usr/bin/crun
sudo mv /usr/local/bin/runc.bak /usr/local/bin/runc

For the next step, let's explore the contents of the image.
Let's export the proper OCI-compliant image archive, which buildah can also do.

Let's add a step to toydocker build, and also add flag of --storage-driver vfs to buildah bud (explained in a second)

toydocker
# ...

STORAGE_PATH="/var/lib/toydocker"

function build {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 build <image_name>" >&2
		exit 1
	fi

	limactl shell ubuntu -- sudo mkdir -p "$STORAGE_PATH"

	# Copy the Dockerfile to the Lima VM
	limactl copy Dockerfile ubuntu:/tmp/Dockerfile
	# Use buildah to build the image inside the Lima VM
	limactl shell ubuntu -- sudo buildah bud --storage-driver vfs -t "$name" -f /tmp/Dockerfile .
	# Save the image as an OCI archive
	limactl shell ubuntu -- sudo buildah push --storage-driver vfs "$name" "oci-archive:$STORAGE_PATH/$name.tar"
}

# ...

Sidenote: We would also want to add limactl shell ubuntu -- sudo mkdir -p "$STORAGE_PATH" to pull, such that it doesn't fail if a build didn't run earlier.

Let's try building again, then shell into the VM to explore the image contents: Now let's run ./toydocker build basicimg and shell into the VM to explore the image contents:

$ ./toydocker build basicimg
...

$ limactl shell ubuntu
$ sudo su # Change to root to operate on `/var/lib`

$ cd /var/lib/toydocker
$ ls
basicimg.tar

$ mkdir img
$ tar xf basicimg.tar -C img
$ cd img
$ ls
blobs  index.json  oci-layout

Cool, the index.json we already know from the OCI overview from before.
Let's explore it by running jq <index.json

{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "digest": "sha256:440d82fb2b7b1716661d2b2e90990c90001cfec0c2413eaf0a7af08f32bd1b43",
      "size": 1192
    }
  ]

Now let's see the manifest by running jq <blobs/sha256/440d82*

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:3c9867a2f51b2fd434ce7bbd79785a3b7978d3a4cbbfdb5a603012ccd55231bd",
    "size": 1031
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:b9d22f021e433758234d99c4053bf0d04d4b8fed9ec35118338088a3bb12f330",
      "size": 4214323
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:4722eacc02c1fdfc11bb8f6fab55abd771c8a8e215f86c4b889cc5566e3753f2",
      "size": 193
    }
  ],
  "annotations": {
    "com.docker.official-images.bashbrew.arch": "arm64v8",
    "org.opencontainers.image.base.digest": "sha256:45e09956dc667c5eff3583c9d94830261fb1ca0be10a0a7db36266edf5de9e1d",
    "org.opencontainers.image.base.name": "docker.io/library/alpine:3.20",
    "org.opencontainers.image.created": "2026-04-16T23:53:21Z",
    "org.opencontainers.image.revision": "0db70ae354ee747109ce0b9a0cfbcd3c907bc822",
    "org.opencontainers.image.source": "https://github.com/alpinelinux/docker-alpine.git#0db70ae354ee747109ce0b9a0cfbcd3c907bc822:aarch64",
    "org.opencontainers.image.url": "https://hub.docker.com/_/alpine",
    "org.opencontainers.image.version": "3.20.10"
  }
}

Layers

We can see the layers, which will make up the container root filesystem.

Let's look at the first one:

$ mkdir /tmp/layer1
$ tar xf blobs/sha256/b9d22f021e* -C /tmp/layer1
$ ls /tmp/layer1
bin  etc   lib    mnt  proc  run   srv  tmp  var
dev  home  media  opt  root  sbin  sys  usr

This looks like the root FS of the alpine image we based our image on (remember the FROM alpine:3.20 in the Dockerfile).

We can see this clearly by running:

cat /tmp/layer1/etc/os-release
# ->
# NAME="Alpine Linux"
# ID=alpine
# VERSION_ID=3.20.10
# PRETTY_NAME="Alpine Linux v3.20"
# HOME_URL="https://alpinelinux.org/"
# BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

Now for the second layer:

$ mkdir /tmp/layer2
$ tar xf blobs/sha256/4722eacc* -C /tmp/layer2
$ ls /tmp/layer2
etc  hello.txt  run

This is the layer we created ourselves on top of the Alpine layer, as evident by the hello.txt.

Config

The manifest also references the config blob (OCI spec here), so let's see what it holds by running jq <blobs/sha256/3c9867a2f51b2fd434ce7bbd79785a3b7978d3a4cbbfdb5a603012ccd55231bd:

{
  "created": "2026-05-22T13:17:23.135992179Z",
  "architecture": "arm64",
  "os": "linux",
  "variant": "v8",
  "config": {
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "cat",
      "/hello.txt"
    ],
    "WorkingDir": "/",
    "Labels": {
      "io.buildah.version": "1.39.3"
    }
  },
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:88b4fba61c4c714a2fc173ddf7e9324a257304696830bf41ad28ecc58c11c95f",
      "sha256:6b6835404fffb40fecff949bd056a5cfc32bd3a3f959712863d219bdf041ca10"
    ]
  },
  "history": [
    {
      "created": "2026-04-16T23:53:24.896953537Z",
      "created_by": "ADD alpine-minirootfs-3.20.10-aarch64.tar.gz / # buildkit",
      "comment": "buildkit.dockerfile.v0"
    },
    {
      "created": "2026-04-16T23:53:24.896953537Z",
      "created_by": "CMD [\"/bin/sh\"]",
      "comment": "buildkit.dockerfile.v0",
      "empty_layer": true
    },
    {
      "created": "2026-05-22T13:17:23.129222578Z",
      "created_by": "/bin/sh -c echo \"Hello!\" > /hello.txt",
      "comment": "FROM docker.io/library/alpine:3.20",
      "empty_layer": true
    },
    {
      "created": "2026-05-22T13:17:24.001882004Z",
      "created_by": "/bin/sh -c #(nop) CMD [\"cat\", \"/hello.txt\"]"
    }
  ]
}

Notably, we can notice:

The config field includes runtime defaults (can be overridden) such as command/entrypoint, working directory and environment variables
The rootfs.diff_ids field references the uncompressed layer digests in order - the manifest's layer descriptors reference the compressed layer blobs

Takeaways

Container images are a filesystem and metadata blueprints. Higher-level tools turn them into a root filesystem and runtime config that low-level OCI runtimes like runc can execute

Container images are encoded according to OCI specifications, and all container tooling like containerd know how to store and work with them

Containers are built in layers, which are more or less derived from the image chain and commands we run in our Dockerfiles

The container images themselves are tarballs which contain metadata (index, manifests, configs), and blobs addressable by their digest (SHA256 hash) which contain layers and sometimes some of the metadata files

Images are built in layers, which are essentially stacked on top of each other (more on that soon, see the "OverlayFS" section)

Implementing `toydocker pull`

We could implement the interaction with an OCI registry (such as Docker Hub) ourselves from scratch too with their REST APIs, but it will take much longer and not the point of this walkthrough.
Let's use skopeo, a tool designed to talk to OCI registries, in order to pull an image from Docker Hub.

First let's install skopeo by running limactl shell ubuntu -- sudo apt-get install -y skopeo.

Now let's add the implementation of pull:

# ...

function pull {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 pull <image_name>" >&2
		exit 1
	fi

	# Save the OCI archive using skopeo
	limactl shell ubuntu -- sudo skopeo copy "docker://$name" "oci-archive:$STORAGE_PATH/$name.tar"
}

# ...

To simplify, we assume Docker Hub (docker.io) and ignore other aspects such as tags completely.

If we now ls /var/lib/toydocker from inside the VM, we'll see a redis.tar. Cool! We successfully pulled the OCI image archive for Redis.

Implementing `toydocker run`

So far, we were able to build or pull an image archive. Now let's actually run it!

This is the content of toydocker for now:

#!/usr/bin/env bash

set -euo pipefail

STORAGE_PATH="/var/lib/toydocker"

function build {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 build <image_name>" >&2
		exit 1
	fi

	limactl shell ubuntu -- sudo mkdir -p "$STORAGE_PATH"

	# Copy the Dockerfile to the Lima VM
	limactl copy Dockerfile ubuntu:/tmp/Dockerfile
	# Use buildah to build the image inside the Lima VM
	limactl shell ubuntu -- sudo buildah bud --storage-driver vfs -t "$name" -f /tmp/Dockerfile .
	# Save the image as an OCI archive
	limactl shell ubuntu -- sudo buildah push --storage-driver vfs "$name" "oci-archive:$STORAGE_PATH/$name.tar"
}

function run {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 run <image_name>" >&2
		exit 1
	fi

	echo "TODO: Implement 'toydocker run'" >&2
	exit 1
}

function pull {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 pull <image_name>" >&2
		exit 1
	fi

	# Save the OCI archive using skopeo
	limactl shell ubuntu -- sudo skopeo copy "docker://$name" "oci-archive:$STORAGE_PATH/$name.tar"
}

function main {
	local cmd="${1:-}"
	shift || true

	case "$cmd" in
		build)
			build "$@"
			;;
		run)
			run "$@"
			;;
		pull)
			pull "$@"
			;;
		*)
			echo "Usage: $0 <build|run|pull>"
			exit 1
			;;
	esac
}

main "$@"

Running a "dumb" container

To simplify, we'll run a container without all the crucial parts that make a container practical (namely, isolation).

As we established before, a container is a Linux process.

Let's make toydocker execute a script (that we'll soon write) inside the Lima VM:


# ...

function run {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 run <image_name> [command...]" >&2
		exit 1
	fi
	
	shift # Remove the image name from the arguments, since we pass "$@" later

	limactl copy toydocker-run.sh ubuntu:/tmp/toydocker-run.sh
	limactl shell --tty=true ubuntu -- sudo bash /tmp/toydocker-run.sh "$STORAGE_PATH/$name.tar" "$@"
}

# ...

Now for the toydocker-run.sh script, let's think about what we want to do:

Reconstruct the filesystem of the container (the manifest contains the layers which should end up on top of one another in order)
Spawn a new process such that:
1. The command to run is derived from the config blob
2. Environment variables and working directory should be constructed according to the config blob
3. Root FS should be the one from step 1
4. It is isolated (the next step of this walkthrough, after we are able to run a basic "dumb" container)

Let's start with step 1, by building out the root FS according to the layers, in order.

First, let's build out the boilerplate for the toydocker-run.sh script:

#!/usr/bin/env bash

set -euo pipefail

archive="${1:-}"
shift || true

# Create a temporary directory for the new container
workdir="$(mktemp -d /tmp/toydocker-run.XXXXXXXXXX)"

rootfs="$workdir/rootfs"
bundle="$workdir/bundle"
mkdir -p "$rootfs" "$bundle"

# Clean up after this process exits, which should also stop the container since we are doing "interactive" mode
# and not running the container detached.
cleanup() {
	rm -rf "$workdir"
}
trap cleanup EXIT

# Copy the archive into a "bundle" folder
tar -xf "$archive" -C "$bundle"

# Extract the manifest from the index
manifest_digest="$(jq -r '.manifests[0].digest' "$bundle/index.json")"
manifest="$bundle/blobs/${manifest_digest/://}"
# Extract the config from the manifest
config_digest="$(jq -r '.config.digest' "$manifest")"
config="$bundle/blobs/${config_digest/://}"

Great, now let's proceed to building out the root FS:

# Store the layers into the `layers` variable (`mapfile` is a useful command, a for-loop could work as well)
mapfile -t layers < <(jq -r '.layers[].digest' "$manifest")

# Extract the layers into the temporary root FS folder
for layer in "${layers[@]}"; do
	tar -xf "$bundle/blobs/${layer/://}" -C "$rootfs"
done

A major thing we skip over here is OCI whiteouts. They sound complex, but what they actually are, are markers for deletion of files/directories. More on that a bit later, but just be aware that this works for our demo image, but isn't good enough for arbitrary OCI images because layer deletions require whiteout handling.

Then we can extract the command from the config, or override it if the user supplied an override (similarly to docker run). Images can define both Entrypoint and Cmd, so we need to combine them:

# Extract the entrypoint and command arrays from the config
mapfile -t entrypoint < <(jq -r '.config.Entrypoint[]?' "$config")
mapfile -t image_cmd < <(jq -r '.config.Cmd[]?' "$config")

if [[ $# -eq 0 ]]; then
	cmd=("${entrypoint[@]}" "${image_cmd[@]}")
elif [[ ${#entrypoint[@]} -gt 0 ]]; then
	cmd=("${entrypoint[@]}" "$@")
else
	cmd=("$@")
fi

Then we can extract the working directory and environment variables from the config too:

# Extract the working directory from the config, defaulting to "/" if not set
image_workdir="$(jq -r '.config.WorkingDir // "/"' "$config")"
# Extract the environment variables from the config into the `env` variable, which should be an array of "KEY=VALUE" strings
mapfile -t env < <(jq -r '.config.Env[]?' "$config")

And now for the most important part - spawning the process while setting its root FS to the one we created earlier (using Linux's chroot):

env -i \
	HOME=/root \
	PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
	"${env[@]}" \
	chroot "$rootfs" /bin/sh -c 'cd "$1" && shift && exec "$@"' sh "$image_workdir" "${cmd[@]}"

Container experts will be mad at me for this since pivot_root should be recommended over chroot, please forgive!

stackoverflow.com/questions/68667003/whats-the-difference-between-chroot-and-pivot-root
What's the difference between chroot and pivot_root?
During my learning of docker, I hear that the Linux command chroot is not enough to isolate the container and we need a...
Answered
1 answer

Putting all this together we get this for toydocker-run.sh:

#!/usr/bin/env bash

set -euo pipefail

archive="${1:-}"
shift || true

# Create a temporary directory for the new container
workdir="$(mktemp -d /tmp/toydocker-run.XXXXXXXXXX)"

rootfs="$workdir/rootfs"
bundle="$workdir/bundle"
mkdir -p "$rootfs" "$bundle"

# Clean up after this process exists, which should also stop the container since we are doing "interactive" mode
# and not running the container detached.
cleanup() {
	rm -rf "$workdir"
}
trap cleanup EXIT

# Copy the archive into a "bundle" folder
tar -xf "$archive" -C "$bundle"

# Extract the manifest from the index
manifest_digest="$(jq -r '.manifests[0].digest' "$bundle/index.json")"
manifest="$bundle/blobs/${manifest_digest/://}"
# Extract the config from the manifest
config_digest="$(jq -r '.config.digest' "$manifest")"
config="$bundle/blobs/${config_digest/://}"

# Store the layers into the `layers` variable (`mapfile` is a useful command, a for-loop could work as well)
mapfile -t layers < <(jq -r '.layers[].digest' "$manifest")

# Extract the layers into the temporary root FS folder
for layer in "${layers[@]}"; do
	tar -xf "$bundle/blobs/${layer/://}" -C "$rootfs"
done

# Extract the entrypoint and command arrays from the config
mapfile -t entrypoint < <(jq -r '.config.Entrypoint[]?' "$config")
mapfile -t image_cmd < <(jq -r '.config.Cmd[]?' "$config")

if [[ $# -eq 0 ]]; then
	cmd=("${entrypoint[@]}" "${image_cmd[@]}")
elif [[ ${#entrypoint[@]} -gt 0 ]]; then
	cmd=("${entrypoint[@]}" "$@")
else
	cmd=("$@")
fi

# Extract the working directory from the config, defaulting to "/" if not set
image_workdir="$(jq -r '.config.WorkingDir // "/"' "$config")"
# Extract the environment variables from the config into the `env` variable, which should be an array of "KEY=VALUE" strings
mapfile -t env < <(jq -r '.config.Env[]?' "$config")

env -i \
	HOME=/root \
	PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
	"${env[@]}" \
	chroot "$rootfs" /bin/sh -c 'cd "$1" && shift && exec "$@"' sh "$image_workdir" "${cmd[@]}"

Let's try it out!

$ ./toydocker run redis
254930:C 22 May 2026 15:55:09.385 # Failed to test the kernel for a bug that could lead to data corruption during background save. Your system could be affected, please report this error.
254930:C 22 May 2026 15:55:09.385 # Redis will now exit to prevent data corruption. Note that it is possible to suppress this warning by setting the following config: ignore-warnings ARM64-COW-BUG

Oh no! Does Redis have a known bug? Is this an environment/kernel issue?

Actually, we skipped an important step that is critical to running a "real" container, so for now let's ignore this and we'll understand how to fix it in the next section. For now, we can override the command and run redis-server --ignore-warnings ARM64-COW-BUG:

$ ./toydocker run redis redis-server --ignore-warnings ARM64-COW-BUG
254979:C 22 May 2026 15:56:43.972 # Failed to test the kernel for a bug that could lead to data corruption during background save. Your system could be affected, please report this error.
254979:C 22 May 2026 15:56:43.972 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
254979:C 22 May 2026 15:56:43.972 * Redis version=8.6.3, bits=64, commit=00000000, modified=1, pid=254979, just started
...
254979:M 22 May 2026 15:56:43.972 * monotonic clock: ARM CNTVCT @ 24 ticks/us
                _._
           _.-``__ ''-._
      _.-``    `.  `_.  ''-._           Redis Open Source
  .-`` .-```.  ```\/    _.,_ ''-._      8.6.3 (00000000/1) 64 bit
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 254979
  `-._    `-._  `-./  _.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |           https://redis.io
  `-._    `-._`-.__.-'_.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |
  `-._    `-._`-.__.-'_.-'    _.-'
      `-._    `-.__.-'    _.-'
          `-._        _.-'
              `-.__.-'

254979:M 22 May 2026 15:56:43.973 * Server initialized
254979:M 22 May 2026 15:56:43.973 * Ready to accept connections tcp
...

Awesome! We are running a Redis container by our own, how awesome is that?

We could check the Redis port is exposed inside the Lima VM:

$ limactl shell ubuntu
$ sudo lsof -nP -iTCP -sTCP:LISTEN
COMMAND      PID            USER  FD   TYPE DEVICE SIZE/OFF NODE NAME
...
redis-ser 254979            root   8u  IPv4 588898      0t0  TCP *:6379 (LISTEN)
redis-ser 254979            root   9u  IPv6 588899      0t0  TCP *:6379 (LISTEN)

Cool, Redis is actually actively listening for incoming TCP connections on port 6379!
This would be a good point to mention that there is a bridge between the host (your Mac) and guest (Lima VM), so we can try talking to Redis from the host on localhost.

Let's give it a shot from the host:

$ redis-cli set foo bar
OK
$ redis-cli --raw get foo
bar

Cool! And if we Ctrl+C in the terminal where we ran the toydocker run command and try this again, we get:

$ redis-cli
Could not connect to Redis at 127.0.0.1:6379: Connection refused

As expected! We are off to an awesome start.

Now let's make the container closer to what actually happens with real container runtimes, and revisit the error we saw before with Redis and why it happens.

Theory of running a "real" container

Process isolation

So far, we were able to successfully run a new process that has its own root FS, and it seemed to work.
A mental model that works for me is to imagine every such process (AKA container) as its own "little OS" (or more accurately, its own "view" of the OS: process tree, filesystem root, hostname, networking stack, users, resource limits, etc.).

However, there were a few things wrong with what we did:

Inefficient file system operations - we copied the layers over each other, which takes up a lot of space and may also take some time to do
No isolation - the process we spawned shares a lot of things with its host (able to signal/communicate with host processes, shared IPC state, etc.). This can be catastrophic in terms of security (container escape), availability (fork bomb inside a rogue container), and more

Remember those Linux concepts I mentioned earlier that we brushed over? Let's revisit those now.

Linux namespaces

Namespaces allow you to isolate what processes can see.

You can check which namespaces are running using lsns or readlink /proc/{PID}/ns/*, e.g.:

$ limactl shell ubuntu bash -c 'readlink /proc/$$/ns/*'
cgroup:[4026531835]
ipc:[4026531839]
mnt:[4026531832]
net:[4026531833]
pid:[4026531836]
pid:[4026531836]
time:[4026531834]
time:[4026531834]
user:[4026531837]
uts:[4026531838]

Each namespace has an ID, and processes sharing the same namespace see the same "world" for that specific namespace type.

You can create a new namespace and executes a program within that newly created namespace with the unshare command (thus called because it "unshares" things with the process which called it, using namespace semantics).

Let's go over a few important namespaces.

PID namespace

Isolates processes IDs:

$ limactl shell ubuntu bash -c 'ps -ef'
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 May22 ?        00:00:05 /usr/lib/systemd/systemd --system --deserialize=81
root           2       0  0 May22 ?        00:00:00 [kthreadd]
root           3       2  0 May22 ?        00:00:00 [pool_workqueue_release]
root           4       2  0 May22 ?        00:00:00 [kworker/R-rcu_gp]
root           5       2  0 May22 ?        00:00:00 [kworker/R-sync_wq
... (omitted) ...

$ limactl shell ubuntu bash -c 'sudo unshare --pid --fork --mount-proc ps aux'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  10016  3440 pts/3    R+   01:11   0:00 ps aux

--pid means a new PID namespace, --fork means spawning as a child process (rather than replacing the parent process), --mount-proc is a bit trickier and means that we mount a new /proc directory (otherwise it reads from the host's and still sees the host's process info).

[!IMPORTANT]

This is the kind of missing runtime setup that can break real programs in surprising ways. Our earlier chroot root filesystem had no /proc mount, so programs that inspect kernel/process state through /proc may fail or misdiagnose the environment. Real runtimes mount /proc as part of container setup, and our later version will do the same.

I would be surprised if this is not why Redis didn't work earlier without ignoring the ARM COW bug warning! I have not looked into it deep enough and may be wrong, but my thesis is that it thought that that there was a bug, but that was a false positive because it could not operate on /proc to inspect files inside.

Mount namespace

Isolates mounts points:

$ limactl shell ubuntu bash -c 'sudo unshare --mount bash -c "mount -t tmpfs tmpfs /mnt && touch /mnt/isolated && ls /mnt"'
isolated

$ limactl shell ubuntu bash -c 'ls /mnt'
lima-cidata

You don't see the inside file since the mount was private.

Usually you also do mount --make-rprivate / so that the mounts don't affect the host.

UTS namespace

Isolates a hostname. Seems a bit silly, but important (consider a container changing the hostname of its host, thus breaking telemetry, identifier resolutions by apps, etc.).

$ limactl shell ubuntu hostname
lima-ubuntu

$ limactl shell ubuntu bash -c 'sudo unshare --uts bash -c "hostname my-new-hostname && hostname"'
my-new-hostname

$ limactl shell ubuntu hostname
lima-ubuntu

Network namespace

Isolates network interfaces, routes, iptable configurations, etc.

$ limactl shell ubuntu ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:55:55:df:78:8b brd ff:ff:ff:ff:ff:ff
    altname enx52111143788
    
$ limactl shell ubuntu bash -c 'sudo unshare --net ip link'
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:0

$ limactl shell ubuntu ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:55:55:df:78:8b brd ff:ff:ff:ff:ff:ff
    altname enx52111143788

Note that the lo network interface (loopback) is down by default, you would want to run ip link set lo up and also bridge the host<>container networks (we'll do that later on when we set up the container properly).

Network namespaces are a bit more complex, so let's dive a bit deeper.

Essentially, for the container to be able to expose ports outside of its network namespace, we want a few things to happen:

Set up a network interface within the container namespace (veth-container)
Set up a network interface on the host (veth-host) that is bridged to the container NIC
Handle traffic forwarding (can be done via host iptables/nftables, but easier to use socat for port forwarding for this already long exercise)

Loading diagram...

# Set up a new net NS named `toyns` (you can list all with `ip netns list`)
$ limactl shell ubuntu sudo ip netns add toyns
# Create a virtual ethernet pair: `veth-host` will stay in the host namespace and `veth-toy` will soon move in the container namespace
$ limactl shell ubuntu sudo ip link add veth-host type veth peer name veth-toy
# Move the `veth-toy` vNIC into the toyns namespace
$ limactl shell ubuntu sudo ip link set veth-toy netns toyns
# Create a Linux bridge named `toy0` (acts like a virtual network switch)
$ limactl shell ubuntu sudo ip link add toy0 type bridge
# Assign the IP `10.10.0.1/24` to the bridge. the vNIC `toy0` will have an IP of `10.10.0.1` and it is within the subnet `10.10.0.0/24`
$ limactl shell ubuntu sudo ip addr add 10.10.0.1/24 dev toy0
# Bring `toy0` up (note that `ip link show` will still show it as "DOWN" until we bring both sides of the bridge up)
$ limactl shell ubuntu sudo ip link set toy0 up
$ limactl shell ubuntu sudo ip link set veth-host master toy0
# Bring the host-side veth interface up  
$ limactl shell ubuntu sudo ip link set veth-host up
# Inside the "toy" namespace:  
# Assign IP 10.10.0.2/24 to veth-toy  
$ limactl shell ubuntu sudo ip netns exec toyns ip addr add 10.10.0.2/24 dev veth-toy
# Bring the namespace-side interface up (this will finally show `toy0` as up if you run `ip link show`)
$ limactl shell ubuntu sudo ip netns exec toyns ip link set veth-toy up
# Bring loopback interface up inside the namespace (many applications assume lo exists)  
limactl shell ubuntu sudo ip netns exec toyns ip link set lo up
# Add a default route inside the namespace  
# All unknown traffic goes through the bridge gateway  
$ limactl shell ubuntu sudo ip netns exec toyns ip route add default via 10.10.0.1

From what I could gather, real container runtimes create anonymous network namespaces, and not named ones like our toyns one.

Cool, now let's demonstrate that it works.

# Terminal 1
$ limactl shell ubuntu sudo ip netns exec toyns python3 -m http.server 8080 --bind 0.0.0.0

# Terminal 2
$ limactl shell ubuntu curl http://10.10.0.2:8080
<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
<li><a href="Dockerfile">Dockerfile</a></li>
<li><a href="init-archive.sh">init-archive.sh</a></li>
... (omitted) ...

Awesome!

Now let's try pinging from our actual host/Mac:

$ curl http://localhost:8080
curl: (7) Failed to connect to localhost port 8080 after 0 ms: Couldn't connect to server

Oh no! What happened?

This was actually expected - we do not automatically forward arbitrary traffic from macOS into this custom namespace/service

In real world scenarios, the Docker Desktop setup is fairly complex - see How Docker Desktop Networking Works Under the Hood for a good overview.

For simplicity, we'll use SSH tunnels. Let's set it up in a bit.
Before that, let's tackle another important problem - web servers often expose common ports (3000, 80, 443, etc.) but we can't reuse the same ports for potentially hundreds of containers running on the host.

So we could randomly select an open port, and then map it to the port inside the namespace.
How can we do that? We could use iptables, but for the sake of simplicity of this already long post, let's simply leverage socat to forward packets.

$ limactl shell ubuntu sudo nohup socat -d -d TCP4-LISTEN:18080,fork,bind=127.0.0.1,reuseaddr TCP4:10.10.0.2:8080 &

What this does is run socat in the background and forwards TCP packets from localhost:18080 to 10.10.0.2:8080 (i.e., our container).

Then to make this accessible on the real host machine from port 8080, we could use SSH tunnels for simplicity as mentioned before.

$ ssh -F ~/.lima/ubuntu/ssh.config -N -L 8080:127.0.0.1:18080 lima-ubuntu

# Then, in a separate terminal:
$ curl localhost:8080
<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
<li><a href="Dockerfile">Dockerfile</a></li>
<li><a href="init-archive.sh">init-archive.sh</a></li>
... (omitted) ...

Awesome!

IPC namespace

Isolates IPC (Inter-Process Communication) - things like shared memory, semaphores, message queues, etc.

$ limactl shell ubuntu bash -c 'ipcmk --queue && ipcs'

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x1cdd8644 0          guywald    644        0            0

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

$ limactl shell ubuntu bash -c 'sudo unshare --ipc ipcs'

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

User namespace

Maps users inside namespace to different users outside. A useful example is having a root user inside the container which maps to a non-root outside of it.

$ limactl shell ubuntu sudo unshare --user --map-root-user id
$ limactl shell ubuntu id
uid=501(guywald) gid=1000(guywald) groups=1000(guywald)

$ limactl shell ubuntu sudo unshare --user --map-root-user bash -c 'id && cat /proc/self/uid_map'
 uid=0(root) gid=0(root) groups=0(root)
         0        0          1

Inside the namespace, the process sees itself as UID 0.
Outside the namespace, that UID maps back to the original host user.

cgroups

If namespaces control what a process can see, cgroups (or Control Groups) in Linux control what a process can use.
These are things like memory, disk I/O, network, and more.

cgroups are represented by directories in /sys/fs/cgroup/{id} and have files for different resources.
For example:

$ # Create a new cgroup called "demo"
$ sudo mkdir /sys/fs/cgroup/demo

$ # Cap memory usage to 100 MiB
$ echo $((100 * 1024 * 1024)) > /sys/fs/cgroup/demo/memory.max 

$ # Can use at most 50ms of CPU time every 100ms (half of a CPU core)
$ echo "50000 100000" > /sys/fs/cgroup/demo/cpu.max

OverlayFS

This is actually a really cool one and core to how containers work (and why they are so lightweight).

As we saw before, containers are made of layers which should be constructed in order.
What we did before, naively, was to simply copy the layers over each other.

This raises a few challenges, as we mentioned before, such as the high disk/storage overhead (especially when you want to share layers between containers), time of copying all the files each time, and more.

The proper way to go about this, is to overlay the layers over each other, setting some of them to be read-only, and leveraging copy-on-write semantics.
This makes container file systems have virtually no overhead - starting them leverages existing layers already on the filesystem, and when they create or modify files, only the changes are reflected.

We can use mount -t overlay to overlay two directories on top of each other, where the bottom layer is automatically treated as read-only by the kernel.
If an operation modifies a file in a lower layer, OverlayFS intercepts the request, copies the file up to the upper directory (writable layer) and applies the change there.

Let's try this out using mount -t overlay inside the Lima VM:

$ limactl shell ubuntu
$ mkdir -p /tmp/overlay/{lower,upper,work,merged}
$ echo "from lower" > /tmp/overlay/lower/file.txt
$ sudo mount -t overlay overlay -o lowerdir=/tmp/overlay/lower,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work /tmp/overlay/merged

$ # Read the file from the merged layer
$ cat /tmp/overlay/merged/file.txt
from lower

$ # Modify the file in the merged layer
$ echo changed > /tmp/overlay/merged/file.txt

$ cat /tmp/overlay/lower/file.txt
from lower

# ^ File was unchanged in the lower layer

$ cat /tmp/overlay/upper/file.txt
changed

# ^ File was changed only in the upper (writable) layer - this is copy-on-write semantics

Capabilities & Seccomp

Linux can run processes with a reduced set of permissions, called capabilities. For example, CAP_NET_ADMIN allows configurating NICs/routes/iptables, etc. and CAP_KILL allows signaling processes.

This is a security feature that allows running containers with a reduced set of permissions.

In addition, the Linux kernel has a security feature that allows for filtering syscalls, a syscall firewall if you will.

Let's compile and run this C program (call it demo.c) inside Lima:

#include <seccomp.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <sys/syscall.h>

int main() {
	// Load a seccomp profile that allows everything by default.
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    seccomp_load(ctx);

    printf("getpid = %ld\n", syscall(SYS_getpid));

    return 0;
}

$ limactl shell ubuntu bash -c 'sudo apt-get install -y gcc libseccomp-dev'

$ limactl cp demo.c ubuntu:/tmp/demo.c

$ limactl shell --workdir /tmp ubuntu bash -c 'gcc demo.c -lseccomp && ./a.out'
getpid = 272916

Now let's disallow getpid and also print the error. We'll call getpid through syscall directly, rather than going through libc, so the result is easier to reason about:

#include <seccomp.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/syscall.h>

int main() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(getpid), 0);
    seccomp_load(ctx);

    errno = 0;
    long pid = syscall(SYS_getpid);

    if (pid == -1) {
        printf("getpid failed: %s\n", strerror(errno));
    } else {
        printf("getpid = %ld\n", pid);
    }

    return 0;
}

Let's try running it again:

$ limactl cp demo.c ubuntu:/tmp/demo.c

$ limactl shell --workdir /tmp ubuntu bash -c 'gcc demo.c -lseccomp && ./a.out'
getpid failed: Operation not permitted

It now fails, due getpid being disallowed by seccomp.

Running a "real" container

Let's put all the theory into practice.

Let's review the toydocker-run.sh script we had before:

#!/usr/bin/env bash

set -euo pipefail

archive="${1:-}"
shift || true

# Create a temporary directory for the new container
workdir="$(mktemp -d /tmp/toydocker-run.XXXXXXXXXX)"

rootfs="$workdir/rootfs"
bundle="$workdir/bundle"
mkdir -p "$rootfs" "$bundle"

# Clean up after this process exists, which should also stop the container since we are doing "interactive" mode
# and not running the container detached.
cleanup() {
	rm -rf "$workdir"
}
trap cleanup EXIT

# Copy the archive into a "bundle" folder
tar -xf "$archive" -C "$bundle"

# Extract the manifest from the index
manifest_digest="$(jq -r '.manifests[0].digest' "$bundle/index.json")"
manifest="$bundle/blobs/${manifest_digest/://}"
# Extract the config from the manifest
config_digest="$(jq -r '.config.digest' "$manifest")"
config="$bundle/blobs/${config_digest/://}"

# Store the layers into the `layers` variable (`mapfile` is a useful command, a for-loop could work as well)
mapfile -t layers < <(jq -r '.layers[].digest' "$manifest")

# Extract the layers into the temporary root FS folder
for layer in "${layers[@]}"; do
	tar -xf "$bundle/blobs/${layer/://}" -C "$rootfs"
done

# Extract the entrypoint and command arrays from the config
mapfile -t entrypoint < <(jq -r '.config.Entrypoint[]?' "$config")
mapfile -t image_cmd < <(jq -r '.config.Cmd[]?' "$config")

if [[ $# -eq 0 ]]; then
	cmd=("${entrypoint[@]}" "${image_cmd[@]}")
elif [[ ${#entrypoint[@]} -gt 0 ]]; then
	cmd=("${entrypoint[@]}" "$@")
else
	cmd=("$@")
fi

# Extract the working directory from the config, defaulting to "/" if not set
image_workdir="$(jq -r '.config.WorkingDir // "/"' "$config")"
# Extract the environment variables from the config into the `env` variable, which should be an array of "KEY=VALUE" strings
mapfile -t env < <(jq -r '.config.Env[]?' "$config")

env -i \
	HOME=/root \
	PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
	"${env[@]}" \
	chroot "$rootfs" /bin/sh -c 'cd "$1" && shift && exec "$@"' sh "$image_workdir" "${cmd[@]}"

Now let's adapt it such that we:

Run the process in its own dedicated namespaces for PID, mount, UTS, network, IPC and user
Place the process in a dedicated cgroup, where resource limits could be applied
Overlay the layers on top of each other using OverlayFS (and utilize cached layers)
Run with stricter capabilities

We'll ignore Seccomp for simplicity, but this should be a straightforward exercise for the reader.

First, let's tweak toydocker itself so it passes a persistent layer cache directory into the runner:

# ...

function run {
	local name="${1:-}"
	if [[ -z "$name" ]]; then
		echo "Usage: $0 run <image_name> [command...]" >&2
		exit 1
	fi
	
	shift

	limactl copy toydocker-run.sh ubuntu:/tmp/toydocker-run.sh
	limactl shell --tty=true ubuntu -- sudo bash /tmp/toydocker-run.sh "$STORAGE_PATH/$name.tar" "$STORAGE_PATH/layers" "$@"
}

# ...

Now the runner can extract each image layer exactly once into $STORAGE_PATH/layers, and every container run can reuse those extracted layer directories as read-only OverlayFS lower layers.

One important detail: the manifest lists layers from base to top.
OverlayFS' lowerdir option wants the most specific layer first, so we reverse the order when building the mount option.

As mentioned before, this implementation doesn't handle OCI whiteouts.
Real image layers can contain entries like .wh.some-file to represent deleting a file from a lower layer, or .wh..wh..opq to make a directory opaque.
A production runtime translates those markers into OverlayFS whiteouts, but we'll skip that for simplicity. If you're interested to learn more, Interpreting whiteout files in Docker image layers is a great explanation.

For networking, we'll reuse the bridge and socat idea from before.
The runner will read the image's exposed TCP ports, forward each one from 127.0.0.1:1<container-port> inside Lima to the container IP, and print the SSH tunnel command you can run from macOS.

#!/usr/bin/env bash

set -euo pipefail

archive="${1:-}"
layer_cache="${2:-}"
shift 2 || true

if [[ -z "$archive" || -z "$layer_cache" ]]; then
	echo "Usage: $0 <oci_archive> <layer_cache_dir> [command...]" >&2
	exit 1
fi

container_id="$(date +%s%N)"
container_dir="$(mktemp -d /tmp/toydocker-run.XXXXXXXXXX)"

bundle="$container_dir/bundle"
upperdir="$container_dir/upper"
workdir="$container_dir/work"
rootfs="$container_dir/rootfs"
env_file="$container_dir/env"
cgroup="/sys/fs/cgroup/toydocker/$container_id"
suffix="${container_id: -6}"


mkdir -p "$bundle" "$upperdir" "$workdir" "$rootfs" "$layer_cache"

# Copy the archive into a "bundle" folder
tar -xf "$archive" -C "$bundle"

# Extract the manifest from the index
manifest_digest="$(jq -r '.manifests[0].digest' "$bundle/index.json")"
manifest="$bundle/blobs/${manifest_digest/://}"
# Extract the config from the manifest
config_digest="$(jq -r '.config.digest' "$manifest")"
config="$bundle/blobs/${config_digest/://}"

# Store the layers in manifest order: base layer first, top layer last.
mapfile -t layers < <(jq -r '.layers[].digest' "$manifest")

cached_layers=()

for layer in "${layers[@]}"; do
	layer_blob="$bundle/blobs/${layer/://}"
	cache_key="${layer/:/_}"
	cache_dir="$layer_cache/$cache_key"

	# If the layer isn't cached, cache it
	if [[ ! -d "$cache_dir" ]]; then
		rm -rf "$cache_dir"
		mkdir -p "$cache_dir"
		tar -xf "$layer_blob" -C "$cache_dir"
	fi

	cached_layers+=("$cache_dir")
done

# OverlayFS expects lowerdir with the top layer first, so reverse the OCI layer order.
lowerdir=""
for ((i=${#cached_layers[@]}-1; i>=0; i--)); do
	if [[ -z "$lowerdir" ]]; then
		lowerdir="${cached_layers[$i]}"
	else
		lowerdir="$lowerdir:${cached_layers[$i]}"
	fi
done

mount -t overlay overlay \
	-o "lowerdir=$lowerdir,upperdir=$upperdir,workdir=$workdir" \
	"$rootfs"

# Extract the entrypoint and command arrays from the config
mapfile -t entrypoint < <(jq -r '.config.Entrypoint[]?' "$config")
mapfile -t image_cmd < <(jq -r '.config.Cmd[]?' "$config")

if [[ $# -eq 0 ]]; then
	cmd=("${entrypoint[@]}" "${image_cmd[@]}")
elif [[ ${#entrypoint[@]} -gt 0 ]]; then
	cmd=("${entrypoint[@]}" "$@")
else
	cmd=("$@")
fi

if [[ ${#cmd[@]} -eq 0 ]]; then
	echo "No command configured for image, and no override was provided" >&2
	exit 1
fi

# Extract the working directory from the config, defaulting to "/" if not set
image_workdir="$(jq -r '.config.WorkingDir // "/"' "$config")"
# Extract the environment variables from the config into an array of "KEY=VALUE" strings
mapfile -t image_env < <(jq -r '.config.Env[]?' "$config")
printf "%s\n" "${image_env[@]}" > "$env_file"

mapfile -t exposed_ports < <(jq -r '.config.ExposedPorts // {} | keys[] | select(endswith("/tcp")) | split("/")[0]' "$config")

netns="toydocker-$suffix"
host_veth="vethh$suffix"
container_veth="vetht$suffix"
container_ip="10.10.0.$((10 + (container_id % 200)))"
bridge="toy0"

# Set up the veth bridge
ip link show "$bridge" >/dev/null 2>&1 || {
	ip link add "$bridge" type bridge
	ip addr add 10.10.0.1/24 dev "$bridge"
	ip link set "$bridge" up
}
ip netns add "$netns"
ip link add "$host_veth" type veth peer name "$container_veth"
ip link set "$container_veth" netns "$netns"
ip link set "$host_veth" master "$bridge"
ip link set "$host_veth" up
ip netns exec "$netns" ip addr add "$container_ip/24" dev "$container_veth"
ip netns exec "$netns" ip link set "$container_veth" up
ip netns exec "$netns" ip link set lo up
ip netns exec "$netns" ip route add default via 10.10.0.1

for port in "${exposed_ports[@]}"; do
	host_port="1$port"
	socat "TCP4-LISTEN:$host_port,fork,bind=127.0.0.1,reuseaddr" "TCP4:$container_ip:$port" &
	echo "Forwarding Lima localhost:$host_port to container $container_ip:$port"
	echo "From macOS, run: ssh -F ~/.lima/ubuntu/ssh.config -N -L $port:127.0.0.1:$host_port lima-ubuntu"
done

mkdir -p "$cgroup"
echo $$ > "$cgroup/cgroup.procs"

ip netns exec "$netns" unshare --fork --pid --mount --uts --ipc --user --map-root-user bash -c '
	set -euo pipefail

	rootfs="$1"
	image_workdir="$2"
	env_file="$3"
	shift 3
	mapfile -t image_env < "$env_file"

	mount --make-rprivate /
	mkdir -p "$rootfs/proc"
	mount -t proc proc "$rootfs/proc"
	hostname toydocker

	cd "$rootfs"

	env -i \
		HOME=/root \
		PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
		"${image_env[@]}" \
		setpriv \
			--bounding-set=-sys_admin,-net_admin,-sys_module,-sys_time,-sys_boot \
			chroot "$rootfs" \
			/bin/sh -c "cd \"$image_workdir\" && exec \"\$@\"" sh "$@"
' bash "$rootfs" "$image_workdir" "$env_file" "${cmd[@]}"

[!NOTE] We ran...
mapfile -t exposed_ports < <(jq -r '.config.ExposedPorts // {} | keys[] | select(endswith("/tcp")) | split("/")[0]' "$config")
...to store the exposed ports from the config into the exposed_ports variable, such that we can make these ports accessible. However, this is a bit of a lie since the exposed ports in the config are not really a "publish rule" - many containers listen on ports they do not expose, and exposed ports do nothing by themselves in Docker unless you publish/map them.

This works, but it leaves a few things behind after the process exits:

The OverlayFS mount
The network namespace and its veth interfaces
The socat port-forwarding processes
The temporary cgroup
The temporary directory we created for this run

Let's add cleanup for those separately:

# ...

overlay_mounted=0
netns_created=0
cgroup_created=0
socat_pids=()

cleanup() {
	for pid in "${socat_pids[@]}"; do
		kill "$pid" 2>/dev/null || true
	done

	if [[ "$overlay_mounted" == "1" ]]; then
		umount "$rootfs" 2>/dev/null || true
	fi

	if [[ "$netns_created" == "1" ]]; then
		ip netns del "$netns" 2>/dev/null || true
	fi

	if [[ "$cgroup_created" == "1" ]]; then
		echo $$ > /sys/fs/cgroup/cgroup.procs 2>/dev/null || true
		rmdir "$cgroup" 2>/dev/null || true
	fi

	rm -rf "$container_dir"
}
trap cleanup EXIT

# ...

mount -t overlay overlay \
	-o "lowerdir=$lowerdir,upperdir=$upperdir,workdir=$workdir" \
	"$rootfs"
overlay_mounted=1

# ...

ip netns add "$netns"
netns_created=1

# ...

socat "TCP4-LISTEN:$host_port,fork,bind=127.0.0.1,reuseaddr" "TCP4:$container_ip:$port" &
socat_pids+=("$!")

# ...

mkdir -p "$cgroup"
cgroup_created=1
echo $$ > "$cgroup/cgroup.procs"

Key differences from our previous version:

The cached layers are immutable shared state. We extract them once and then keep reusing them across runs
Each container gets a fresh writable upperdir, so any files it creates or changes disappear when this run exits
The process runs inside its own PID, mount, UTS, IPC, network and user namespaces.
The process is placed under a dedicated cgroup. This version does not set hard limits yet, but this is where we could add memory, CPU or pids limits exactly like we saw earlier
The image's exposed ports are forwarded from 127.0.0.1:1<container-port> inside Lima to the container IP with socat
The cleanup hook tears down per-run resources, while keeping the cached layers intact for the next run
After all the setup work that needs elevated permissions is done, we use setpriv to drop a few especially dangerous capabilities before entering the container root filesystem

This is still not a production-grade runtime.
We are intentionally skipping full Docker Desktop style networking, Seccomp in the final runner, tags, whiteout handling, full OCI runtime config generation and lifecycle management, but this is already much closer to the real shape of a container runtime.

Leveraging `runc`

At this point, we did a lot manually - mounted OverlayFS, created namespaces, mounted /proc, set the hostname, set up networking, entered the container root FS (chroot/pivot_root), dropped capabilities, added cleanups, etc.

This was a mouthful - the setup is fairly complex and a lot can go around!
This is useful for learning, but it is not how you would want to maintain a runtime.

This is exactly where an OCI runtime such as runc or crun comes in.

Very important to understand that runc is, in its core, a container runtime.
It does not do the work higher-level tools like dockerd/containerd do.

Namely:

runc does not pull images.
runc does not understand OCI image archives directly.
runc does not manage Docker-style networking, image names, tags, registries or build cache.
runc does create a process from an OCI runtime bundle.

An OCI runtime bundle is basically:

bundle/
  config.json
  rootfs/

The rootfs/ directory is the prepared filesystem the process should see as /.
The config.json file tells the runtime what to execute and which namespaces, mounts, capabilities, users, cgroups and other runtime settings to apply.

So let's simplify our runner by keeping the image-preparation part, but delegating the low-level runtime part to runc.

We will focus on images like our basicimg, and avoid recreating the custom bridge/port-forwarding logic from earlier.

#!/usr/bin/env bash

set -euo pipefail

archive="${1:-}"
shift || true

if [[ -z "$archive" ]]; then
	echo "Usage: $0 <oci_archive>" >&2
	exit 1
fi

container_id="toydocker-runc-$(date +%s%N)"
workdir="$(mktemp -d /tmp/toydocker-runc.XXXXXXXXXX)"

image_dir="$workdir/image"
bundle="$workdir/bundle"
rootfs="$bundle/rootfs"

cleanup() {
	cd / 2>/dev/null || true
	runc delete -f "$container_id" 2>/dev/null || true
	rm -rf "$workdir"
}
trap cleanup EXIT

mkdir -p "$image_dir" "$rootfs"

# Unpack the OCI image archive
tar -xf "$archive" -C "$image_dir"

# Locate image manifest and config
manifest_digest="$(jq -r '.manifests[0].digest' "$image_dir/index.json")"
manifest="$image_dir/blobs/${manifest_digest/://}"

config_digest="$(jq -r '.config.digest' "$manifest")"
image_config="$image_dir/blobs/${config_digest/://}"

# Naively reconstruct the root filesystem from image layers.
# This is intentionally still simplified and does not handle OCI whiteouts.
mapfile -t layers < <(jq -r '.layers[].digest' "$manifest")
for layer in "${layers[@]}"; do
	tar -xf "$image_dir/blobs/${layer/://}" -C "$rootfs"
done

# Generate a default OCI runtime config.
cd "$bundle"
runc spec

args_json="$(jq -c '((.config.Entrypoint // []) + (.config.Cmd // []))' "$image_config")"
env_json="$(jq -c '(.config.Env // [])' "$image_config")"
cwd="$(jq -r '.config.WorkingDir // "/"' "$image_config")"

if [[ "$args_json" == "[]" ]]; then
	echo "No command configured for image" >&2
	exit 1
fi

# Patch the default runtime config with values from the image config.
jq \
	--argjson args "$args_json" \
	--argjson env "$env_json" \
	--arg cwd "$cwd" \
	'
		.root.path = "rootfs"
		| .root.readonly = false
		| .process.args = $args
		| .process.cwd = $cwd
		| .process.env = (if ($env | length) > 0 then $env else (.process.env // []) end)
		| .hostname = "toydocker-runc"
	' config.json > config.tmp
mv config.tmp config.json

runc run "$container_id"

Then we could wire it from the host in the same style as before:

limactl copy toydocker-runc-run.sh ubuntu:/tmp/toydocker-runc-run.sh
limactl shell --tty=true ubuntu -- sudo bash /tmp/toydocker-runc-run.sh /var/lib/toydocker/basicimg.tar

This version is much smaller because runc handles the runtime-specific part, for which we did the hard work ourselves earlier.

runc creates the container process, applies namespaces, mounts, process args/env/CWD, capabilities and more.

In our toydocker implementation, we went beyond the container runtime - we actually handled unpacking the image archive, finding the manifest/config and eventually translating all of this into the runtime config (which is the input for runc).

That boundary is the key idea.

In real life, containerd and its snapshotters take care of content storage, layer unpacking and snapshots.
The OCI runtime (runc/crun) is called only after the root filesystem and runtime config are ready.

So the manual runner taught us what needs to happen, and the runc runner shows where the real abstraction boundary sits.

Conclusion

So, what did we do?

Understand how containers work, what container images actually are, and types of specs OCI provides
Poke around inside the OCI archive and understand the metadata images hold, and the layers they hold which turn into the filesystem root a container sees
Build a toy tool that supports some of the things that Docker CLI/dockerd/containerd do, like pulling/building images and extracting the runtime config
Run a "dumb" container, where we reconstructed the root FS in a very inefficient way and spawn a new process that runs the container's "entrypoint" with its root FS pointing to it
Run a smarter container, that leverages Linux capabilities and allowed us to isolate the process (namespaces, cgroups, seccomp, capabilities)

Even though we went deep and technical, frankly there was a lot of oversimplification here for the sake of education, and moreso I am not a container expert so for anything deeper than this, I'm sure you will be able to find helpful resources.
As mentioned, the goal here was not to build a production container runtime, but rather to make the black box less intimidating.

I hope this was useful or interesting to anyone! Containers, to me, are amazing and an impressive technology.

Next up on the "from scratch" series - AI agents! Stay tuned on RSS/LinkedIn/X if interested.

Docker from scratch

#What are containers anyway?

##Hello, Linux!

##Containers are Linux processes

#Setting up the lab

##Docker architecture

##Spinning up our own Linux VM

#toydocker boilerplate

#Implementing toydocker build

###What actually is an image?

###Layers

###Config

#Implementing toydocker pull

#Implementing toydocker run

##Running a "dumb" container

##Theory of running a "real" container

###Process isolation

###Linux namespaces

####PID namespace

####Mount namespace

####UTS namespace

####Network namespace

####IPC namespace

####User namespace

####cgroups

###OverlayFS

###Capabilities & Seccomp

##Running a "real" container

#Leveraging runc

#Conclusion