Down the Rabbit Hole: The MPI and Chroot Nightmare

The Simple Plan (That Wasn’t Simple)

After seeing the mysterious 26GB memory overhead in Docker, my plan was straightforward:

Extract the container’s filesystem
Run the same Python scripts natively
Compare the results
Done!

Ha. Hahahaha. No.

Attempt 1: Just Run It

My first thought: “Let’s just run the Docker scripts on my system. How hard can it be?”

1
2
python /home/khan/benchmark-spark/benchmarks/trtllm_benchmark.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

What I got: Runtime nightmare.

Of course. The container has TensorRT-LLM, specific CUDA versions, MPI libraries, and about a million other dependencies that my host system doesn’t have (or has different versions of).

Okay, fine. Let’s extract the container and use chroot.

Extracting the Container

Docker containers are just fancy tarballs with layers. To extract:

1
2
3
4
5
6
7
8
# Export the container filesystem
docker create --name temp nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
docker export temp > container.tar
docker rm temp

# Extract to a directory
mkdir -p /home/khan/container-rootfs
sudo tar -xf container.tar -C /home/khan/container-rootfs

Now I have the entire container filesystem at /home/khan/container-rootfs/. Cool!

The MPI Library Hunt Begins

Time to run something:

1
sudo chroot /home/khan/container-rootfs python3 /workspace/benchmarks/trtllm_benchmark.py

New error:

What? Let me check which MPI it’s using:

1
2
which mpirun
# /usr/bin/mpirun  ← System MPI!

Ah. The chroot is finding the system’s MPI (installed at /usr/bin/mpirun) instead of the container’s MPI (at /opt/hpcx/ompi/bin/mpirun inside the rootfs).

The PATH Dance

TensorRT-LLM uses HPC-X OpenMPI from NVIDIA. The container has it at /opt/hpcx/ompi/. But when I chroot, the PATH still points to system binaries first.

Solution: Explicitly set PATH to prioritize container binaries.

1
export PATH="/opt/hpcx/ompi/bin:$PATH"

Run again… new error:

The MPI binary is now correct, but it can’t find its libraries!

The LD_LIBRARY_PATH Saga

The container’s MPI needs libraries from /opt/hpcx/ompi/lib/, but LD_LIBRARY_PATH doesn’t include it.

1
export LD_LIBRARY_PATH="/opt/hpcx/ompi/lib:$LD_LIBRARY_PATH"

Run again… DIFFERENT error:

Wait, what? Now it’s finding the HPC-X libraries but they’re conflicting with system libraries!

The Real Problem

Here’s what was happening:

The system has OpenMPI installed (libmpi.so)
The container has HPC-X OpenMPI (/opt/hpcx/ompi/lib/libmpi.so)
TensorRT-LLM needs the HPC-X version
But the dynamic linker was mixing system and container libraries

The fix: Put container libraries at the FRONT of LD_LIBRARY_PATH:

1
export LD_LIBRARY_PATH="/opt/hpcx/ompi/lib:/usr/local/lib/python3.12/dist-packages/tensorrt_libs:$LD_LIBRARY_PATH"

Finally, MPI works!

The CUDA Version Surprise

Got MPI working. Started a benchmark. New error:

Wait, CUDA 13.0? We’re using CUDA 12.9!

Turns out: The system symlink /usr/local/cuda pointed to CUDA 13.0. The container needs CUDA 12.9.

Fix:

1
2
export CUDA_HOME="/usr/local/cuda-12.9"
export PATH="/usr/local/cuda-12.9/bin:$PATH"

The Full Chroot Wrapper Script

After all this pain, I created a proper chroot wrapper that handles everything:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
SCRIPT_DIR="/home/khan/container-rootfs"

# Mount necessary filesystems
sudo mount -t proc /proc "${SCRIPT_DIR}/proc"
sudo mount --rbind /sys "${SCRIPT_DIR}/sys"
sudo mount --rbind /dev "${SCRIPT_DIR}/dev"
sudo mount --bind /home "${SCRIPT_DIR}/home"
sudo mount --bind /data "${SCRIPT_DIR}/data"

# DNS resolution
sudo cp /etc/resolv.conf "${SCRIPT_DIR}/etc/resolv.conf"

# Run in chroot with proper environment
sudo chroot "${SCRIPT_DIR}" /usr/bin/env -i \
    HOME=/root \
    PATH="/opt/hpcx/ompi/bin:/usr/local/cuda-12.9/bin:/usr/local/bin:/usr/bin:/bin" \
    LD_LIBRARY_PATH="/opt/hpcx/ompi/lib:/usr/local/lib/python3.12/dist-packages/tensorrt_libs:/usr/local/cuda-12.9/lib64" \
    CUDA_HOME="/usr/local/cuda-12.9" \
    PYTHONPATH="/usr/local/lib/python3.12/dist-packages" \
    "$@"

# Cleanup: unmount everything
sudo umount "${SCRIPT_DIR}/data"
sudo umount "${SCRIPT_DIR}/home"
sudo umount "${SCRIPT_DIR}/dev"
sudo umount "${SCRIPT_DIR}/sys"
sudo umount "${SCRIPT_DIR}/proc"

Now I can run:

1
./run_in_rootfs.sh python3 /home/khan/benchmark-spark/benchmarks/trtllm_benchmark.py

And it actually works.

Lessons Learned

Library paths matter: System vs container libraries will bite you
Environment is everything: PATH, LD_LIBRARY_PATH, CUDA_HOME all critical
MPI is picky: HPC-X OpenMPI isn’t interchangeable with system OpenMPI
Filesystem mounts: Need /proc, /sys, /dev, /home, and /data bind mounts
DNS matters: Even forgot /etc/resolv.conf initially!

The Hard Part Is Over… Right?

With a working chroot environment, I could finally start benchmarking. But getting here took hours of debugging library paths and runtime errors.

Sometimes I wonder if this is why people just accept Docker’s overhead - at least it works out of the box!

But now that we have both Docker and native environments working, we can actually compare them fairly. And that’s where things get interesting…

In the next post: What Grace Blackwell’s unified memory architecture actually means, and why Docker’s cgroups don’t understand it.

Previous: ← Part 1: The Mystery
Next: Part 3: The Unified Memory Revelation →

GitHub Repo: benchmark-spark

The Simple Plan (That Wasn’t Simple)#

Attempt 1: Just Run It#

Extracting the Container#

The MPI Library Hunt Begins#

The PATH Dance#

The LD_LIBRARY_PATH Saga#

The Real Problem#

The CUDA Version Surprise#

The Full Chroot Wrapper Script#

Lessons Learned#

The Hard Part Is Over… Right?#