Slurm pyxis using a docker
If I sent you a link to this you probably want the TL;DR at the bottom
Context
Previously: 220712-2208 Slurm creating modifiable persistent container
Problem: I have a docker image in a private docker registry that needs user/pass.
I need to use it in slurm
’s pyxis
.
The default srun --container-image ..
syntax has no obvious place for a Docker registry user/pass.
Trying to use an image from a private registry does this:
$ srun --mem=16384 -c2 --gres=gpu:v100:2 --container-image comp/myimage:latest
slurmstepd: error: pyxis: child 2505947 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis: [INFO] Querying registry for permission grant
slurmstepd: error: pyxis: [INFO] Authenticating with user: <anonymous>
slurmstepd: error: pyxis: [INFO] Authentication succeeded
slurmstepd: error: pyxis: [INFO] Fetching image manifest list
slurmstepd: error: pyxis: [INFO] Fetching image manifest
slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/[...] returned error code: 401 Unauthorized
Slurm’s pyxis1 uses enroot
2 to do the container magic that includes interfacing with Docker.
enroot
is installed on the box, Docker isn’t, I have no root access.
Option/attempt 1: Using enroot config to pass a credentials file
I need to pass through srun
configs to enroot, so it can access the docker registry.
To pass credentials to it, create a credentials file in $ENROOT_CONFIG_PATH/.credentials
:
# DockerHub
machine auth.docker.io login <login> password <password>
That env var is not set in the base system, set it to /home/me/enroot/
and put the file there - same (no) result.
After googling, found this really detailed thread about the way pyxis
handles environment variables:
enroot/import.md at master · NVIDIA/enroot
Especially this specific comment:
pyxis doesn’t use environment variables defined in enroot .env files · Issue #46 · NVIDIA/pyxis
So basically,
enroot
andpyxis
are behaving in opposite ways:
- if a ‘dynamic’ env var is defined in
enroot
conf files,enroot
passes it to the container, but notpyxis
- if it’s not defined in
enroot
conf files,enroot
doesn’t pass it to the container, butpyxis
does.
I don’t have write access to the enroot config files, but the $ENROOT_CONFIG_PATH
isn’t set there, I should be able to change it. No effect though.
Giving up for now, though that would’ve been the most beautiful solution.
Attempt 2: Get the image separately through enroot
I could use pure enroot
to get the docker image, then pass the file to srun
.
Run “Docker” Containers with NVIDIA Enroot
To use a oath authentication and a token you would need to sign-up/sign-in and create a token (which you can save for reuse) and then do the container import as,
enroot import 'docker://$oauthtoken@nvcr.io#nvidia/tensorflow:21.04-tf1-py3'
Awesome, let’s create a token and try:
… okay, what’s the address of the docker hub? The hub.docker.com
one that’s default and ergo not used anywhere, but I need to pass it explicitly?..
Anyway let’s try to get bitnami/minideb from a public repo to pin the syntax down.
hub.docker.com
returned 404s, trial and error led me to docker.io
:
[INFO] Querying registry for permission grant
[INFO] Permission granted
[INFO] Fetching image manifest list
[ERROR] Could not process JSON input
curl: (23) Failed writing body (1011 != 4220)
registry-1.docker.io
actually asked me for a password!
enroot import 'docker://$token@registry-1.docker.io#bitnami/minideb:latest'
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: $token
Enter host password for user '$token':
[ERROR] URL https://auth.docker.io/token returned error code: 401 Unauthorized
Without providing the token the image gets downloaded! Then I found index.docker.io
3 that seems to be the correct one.
Okay, let’s get my private one
me@slurm-box:/slurm/me$ ENROOT_CONFIG_PATH=/home/me/enroot enroot import 'docker://index.docker.io#comp/myimage:latest'
401 error unauthorized, still ignoring my .credentials
or env variable pointing to it.
Docker username only:
enroot import 'docker://mydockerusername@index.docker.io#comp/myimage:latest'
Asks me for a password and then imports correctly! And creates a file called myimage.sqsh
in the current dir.
Woohoo, working way to get docker images from private registry!
$ enroot start myimage.sqsh
enroot-nsenter: failed to create user namespace: Operation not permitted
Okay, so I’m not allowed to start them with enroot
- not that I had any reason to.
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/data --container-workdir /data --pty bash
Drops me inside a shell in the container - it works!
Next step - using the Docker token.
Docker seems to see it as password replacement, this conflicts with official docus:
# Import Tensorflow 19.01 from NVIDIA GPU Cloud
$ enroot import --output tensorflow.sqsh 'docker://$oauthtoken@nvcr.io#nvidia/tensorflow:19.01-py3'
On further googling - that’s a thing specific for nvcr.io
, Docker Hub uses Docker stuff and I use that token as password replacement, period. Okay.
Had issues with mounting stuff as /data
by default, but that specific bit is used in the docker image too - used something else.
The Dockerfile also has an ENTRYPOINT
and sbin
wants something to execute, true
can be passed. Couldn’t get this to work, no true
means sbin
refuses to start, passing true
makes it ignore the entrypoint altogether. --[no-]container-entrypoint
from docu didn’t help - leaving for later.
Final line:
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable python3 -m trainer_module -i /data/ -o /SLURMdata/Checkpoints/ --config-file /SLURMdata/config.yaml
This:
- makes the image writable, so huggingface and friends can download stuff
- makes
/slurm/me/data
available as/SLURMdata
inside the image; - passes a config file to it that I have inside
/data/config.yaml
to the trainer (that accesses it as/SLURMdata/config.yaml
) - runs the training on a dataset inside the directory that the Dockerfile puts inside
/data
in the image itself (the one that conflicted with mine earlier), - puts training results in a directory inside
/SLURMdata
which means it’s available to me aftersbin
is done in my/slurm/me/data
directory.
TODO / for later
- Try again to find a way to use a
.credentials
file, one command less to run then - How to run my docker image’s
ENTRYPOINT
(More) resources
- Enroot
import
tutorial/help: enroot/import.md at master · NVIDIA/enroot - Really detailed about pure-enroot Docker bits, superset of the prev. link: Run “Docker” Containers with NVIDIA Enroot
pyxis
has more detailed usage guides! Especially this one: Usage · NVIDIA/pyxis Wiki- Extremely detailed/useful discussion about pyxis/enroot environment variables:pyxis doesn’t use environment variables defined in enroot .env files · Issue #46 · NVIDIA/pyxis
- Slurm with Docker and examples of doing Jupyter Notebook: Containers - NHR@KIT User Documentation
- Docker best practices, esp.
ENTRYPOINT
: Best practices for writing Dockerfiles | Docker Documentation
TL;DR
Two ways I found, passing credentials for the docker registry didn’t work, separately downloading the image and then running it did. Read the entire post if you want details on most of this.
Getting the image:
enroot import 'docker://mydockerusername@index.docker.io#comp/myimage:latest'
Replace mydockerusername
with your docker username, comp
with companyname and myimage
with the name of the image.
It will ask you for your Docker pass or Personal Access Token.
Will download the image into a *.sqsh
file in the current directory or whatever you pass through the -o
parameter.
Running the image
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable your_command_to_run
# or - if you are running the thing I'm running - ...
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable python3 -m trainer_module -i /data/ -o /SLURMdata/Checkpoints/ --config-file /SLURMdata/config.yaml
In decreasing order of interest/generality:
- pass the downloaded
*.sqsh
file to--container-image
. - Environment variables get passed as-is in most cases. If you’d do
docker run --env ENV_VAR_NAME
, here you’d sayENV_VAR_NAME=whatever srun ...
or justexport ...
it before running and it should work. --container-writable
is needed to make the filesystem writable, huggingface needs that to write cache files--container-mounts
- are
/dir_in_your_fs:/dir_inside_docker_image
- Make sure the Docker itself doesn’t have anything unexpected located at
/dir_inside_docker_image
- are