The Simple Plan (That Wasn’t Simple)
After seeing the mysterious 26GB memory overhead in Docker, my plan was straightforward:
- Extract the container’s filesystem
- Run the same Python scripts natively
- Compare the results
- Done!
Ha. Hahahaha. No.
Attempt 1: Just Run It
My first thought: “Let’s just run the Docker scripts on my system. How hard can it be?”
| |
What I got: Runtime nightmare.
ModuleNotFoundError: No module named 'tensorrt_llm'
ImportError: libmpi.so.40: cannot open shared object file
CUDA Error: no CUDA-capable device is detected
Of course. The container has TensorRT-LLM, specific CUDA versions, MPI libraries, and about a million other dependencies that my host system doesn’t have (or has different versions of).
Okay, fine. Let’s extract the container and use chroot.
Extracting the Container
Docker containers are just fancy tarballs with layers. To extract:
| |
Now I have the entire container filesystem at /home/khan/container-rootfs/. Cool!
The MPI Library Hunt Begins
Time to run something:
| |
New error:
[TRT-LLM] [I] MPI size: 1, rank: 0
MPI Error: undefined symbol: ucs_config_doc_nop
What? Let me check which MPI it’s using:
| |
Ah. The chroot is finding the system’s MPI (installed at /usr/bin/mpirun) instead of the container’s MPI (at /opt/hpcx/ompi/bin/mpirun inside the rootfs).
The PATH Dance
TensorRT-LLM uses HPC-X OpenMPI from NVIDIA. The container has it at /opt/hpcx/ompi/. But when I chroot, the PATH still points to system binaries first.
Solution: Explicitly set PATH to prioritize container binaries.
| |
Run again… new error:
mpirun: error while loading shared libraries: libucs.so.0: cannot open shared object file
The MPI binary is now correct, but it can’t find its libraries!
The LD_LIBRARY_PATH Saga
The container’s MPI needs libraries from /opt/hpcx/ompi/lib/, but LD_LIBRARY_PATH doesn’t include it.
| |
Run again… DIFFERENT error:
symbol lookup error: /opt/hpcx/ompi/lib/libucc.so.1: undefined symbol: ucp_worker_progress
Wait, what? Now it’s finding the HPC-X libraries but they’re conflicting with system libraries!
The Real Problem
Here’s what was happening:
- The system has OpenMPI installed (
libmpi.so) - The container has HPC-X OpenMPI (
/opt/hpcx/ompi/lib/libmpi.so) - TensorRT-LLM needs the HPC-X version
- But the dynamic linker was mixing system and container libraries
The fix: Put container libraries at the FRONT of LD_LIBRARY_PATH:
| |
Finally, MPI works!
The CUDA Version Surprise
Got MPI working. Started a benchmark. New error:
RuntimeError: Triton only supports CUDA 10.0 or higher, but got CUDA version: 13.0
Wait, CUDA 13.0? We’re using CUDA 12.9!
Turns out: The system symlink /usr/local/cuda pointed to CUDA 13.0. The container needs CUDA 12.9.
Fix:
| |
The Full Chroot Wrapper Script
After all this pain, I created a proper chroot wrapper that handles everything:
| |
Now I can run:
| |
And it actually works.
Lessons Learned
- Library paths matter: System vs container libraries will bite you
- Environment is everything: PATH, LD_LIBRARY_PATH, CUDA_HOME all critical
- MPI is picky: HPC-X OpenMPI isn’t interchangeable with system OpenMPI
- Filesystem mounts: Need /proc, /sys, /dev, /home, and /data bind mounts
- DNS matters: Even forgot
/etc/resolv.confinitially!
The Hard Part Is Over… Right?
With a working chroot environment, I could finally start benchmarking. But getting here took hours of debugging library paths and runtime errors.
Sometimes I wonder if this is why people just accept Docker’s overhead - at least it works out of the box!
But now that we have both Docker and native environments working, we can actually compare them fairly. And that’s where things get interesting…
In the next post: What Grace Blackwell’s unified memory architecture actually means, and why Docker’s cgroups don’t understand it.
Previous: ← Part 1: The Mystery
Next: Part 3: The Unified Memory Revelation →
GitHub Repo: benchmark-spark