In the middle of the desert you can say anything you want
If I sent you a link to this you probably want the TL;DR at the bottom
Previously: 220712-2208 Slurm creating modifiable persistent container
Problem: I have a docker image in a private docker registry that needs user/pass.
I need to use it in slurm
’s pyxis
.
The default srun --container-image ..
syntax has no obvious place for a Docker registry user/pass.
Trying to use an image from a private registry does this:
$ srun --mem=16384 -c2 --gres=gpu:v100:2 --container-image comp/myimage:latest
slurmstepd: error: pyxis: child 2505947 failed with error code: 1
slurmstepd: error: pyxis: failed to import docker image
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis: [INFO] Querying registry for permission grant
slurmstepd: error: pyxis: [INFO] Authenticating with user: <anonymous>
slurmstepd: error: pyxis: [INFO] Authentication succeeded
slurmstepd: error: pyxis: [INFO] Fetching image manifest list
slurmstepd: error: pyxis: [INFO] Fetching image manifest
slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/[...] returned error code: 401 Unauthorized
Slurm’s pyxis1 uses enroot
2 to do the container magic that includes interfacing with Docker.
enroot
is installed on the box, Docker isn’t, I have no root access.
I need to pass through srun
configs to enroot, so it can access the docker registry.
To pass credentials to it, create a credentials file in $ENROOT_CONFIG_PATH/.credentials
:
# DockerHub
machine auth.docker.io login <login> password <password>
That env var is not set in the base system, set it to /home/me/enroot/
and put the file there - same (no) result.
After googling, found this really detailed thread about the way pyxis
handles environment variables:
enroot/import.md at master · NVIDIA/enroot
Especially this specific comment:
pyxis doesn’t use environment variables defined in enroot .env files · Issue #46 · NVIDIA/pyxis
So basically,
enroot
andpyxis
are behaving in opposite ways:
- if a ‘dynamic’ env var is defined in
enroot
conf files,enroot
passes it to the container, but notpyxis
- if it’s not defined in
enroot
conf files,enroot
doesn’t pass it to the container, butpyxis
does.
I don’t have write access to the enroot config files, but the $ENROOT_CONFIG_PATH
isn’t set there, I should be able to change it. No effect though.
Giving up for now, though that would’ve been the most beautiful solution.
enroot
I could use pure enroot
to get the docker image, then pass the file to srun
.
Run “Docker” Containers with NVIDIA Enroot
To use a oath authentication and a token you would need to sign-up/sign-in and create a token (which you can save for reuse) and then do the container import as,
enroot import 'docker://$oauthtoken@nvcr.io#nvidia/tensorflow:21.04-tf1-py3'
Awesome, let’s create a token and try:
… okay, what’s the address of the docker hub? The hub.docker.com
one that’s default and ergo not used anywhere, but I need to pass it explicitly?..
Anyway let’s try to get bitnami/minideb from a public repo to pin the syntax down.
hub.docker.com
returned 404s, trial and error led me to docker.io
:
[INFO] Querying registry for permission grant
[INFO] Permission granted
[INFO] Fetching image manifest list
[ERROR] Could not process JSON input
curl: (23) Failed writing body (1011 != 4220)
registry-1.docker.io
actually asked me for a password!
enroot import 'docker://$token@registry-1.docker.io#bitnami/minideb:latest'
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: $token
Enter host password for user '$token':
[ERROR] URL https://auth.docker.io/token returned error code: 401 Unauthorized
Without providing the token the image gets downloaded! Then I found index.docker.io
3 that seems to be the correct one.
Okay, let’s get my private one
me@slurm-box:/slurm/me$ ENROOT_CONFIG_PATH=/home/me/enroot enroot import 'docker://index.docker.io#comp/myimage:latest'
401 error unauthorized, still ignoring my .credentials
or env variable pointing to it.
Docker username only:
enroot import 'docker://mydockerusername@index.docker.io#comp/myimage:latest'
Asks me for a password and then imports correctly! And creates a file called myimage.sqsh
in the current dir.
Woohoo, working way to get docker images from private registry!
$ enroot start myimage.sqsh
enroot-nsenter: failed to create user namespace: Operation not permitted
Okay, so I’m not allowed to start them with enroot
- not that I had any reason to.
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/data --container-workdir /data --pty bash
Drops me inside a shell in the container - it works!
Next step - using the Docker token.
Docker seems to see it as password replacement, this conflicts with official docus:
# Import Tensorflow 19.01 from NVIDIA GPU Cloud
$ enroot import --output tensorflow.sqsh 'docker://$oauthtoken@nvcr.io#nvidia/tensorflow:19.01-py3'
On further googling - that’s a thing specific for nvcr.io
, Docker Hub uses Docker stuff and I use that token as password replacement, period. Okay.
Had issues with mounting stuff as /data
by default, but that specific bit is used in the docker image too - used something else.
The Dockerfile also has an ENTRYPOINT
and sbin
wants something to execute, true
can be passed. Couldn’t get this to work, no true
means sbin
refuses to start, passing true
makes it ignore the entrypoint altogether. --[no-]container-entrypoint
from docu didn’t help - leaving for later.
Final line:
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable python3 -m trainer_module -i /data/ -o /SLURMdata/Checkpoints/ --config-file /SLURMdata/config.yaml
This:
/slurm/me/data
available as /SLURMdata
inside the image;/data/config.yaml
to the trainer (that accesses it as /SLURMdata/config.yaml
)/data
in the image itself (the one that conflicted with mine earlier),/SLURMdata
which means it’s available to me after sbin
is done in my /slurm/me/data
directory..credentials
file, one command less to run thenENTRYPOINT
import
tutorial/help: enroot/import.md at master · NVIDIA/enrootpyxis
has more detailed usage guides! Especially this one: Usage · NVIDIA/pyxis WikiENTRYPOINT
: Best practices for writing Dockerfiles | Docker DocumentationTwo ways I found, passing credentials for the docker registry didn’t work, separately downloading the image and then running it did. Read the entire post if you want details on most of this.
enroot import 'docker://mydockerusername@index.docker.io#comp/myimage:latest'
Replace mydockerusername
with your docker username, comp
with companyname and myimage
with the name of the image.
It will ask you for your Docker pass or Personal Access Token.
Will download the image into a *.sqsh
file in the current directory or whatever you pass through the -o
parameter.
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable your_command_to_run
# or - if you are running the thing I'm running - ...
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable python3 -m trainer_module -i /data/ -o /SLURMdata/Checkpoints/ --config-file /SLURMdata/config.yaml
In decreasing order of interest/generality:
*.sqsh
file to --container-image
.docker run --env ENV_VAR_NAME
, here you’d say ENV_VAR_NAME=whatever srun ...
or just export ...
it before running and it should work.--container-writable
is needed to make the filesystem writable, huggingface needs that to write cache files--container-mounts
/dir_in_your_fs:/dir_inside_docker_image
/dir_inside_docker_image
Really nice, and the blog post introducing it has a lot of general info about datasets that I found very interesting.
From python - How do I type hint a method with the type of the enclosing class? - Stack Overflow:
If you have a classmethod and want to annotate the return value as that same class you’re now defining, you can actually do the logical thing!
from __future__ import annotations
class Whatever:
# ...
@classmethod what(cls) -> Whatever:
return cls()
It started with writing type hints for a complex dict, which led me to TypedDict, slowly went into “why can’t I just do a dataclass as with the rest”.
Found two libraries:
MyClass(**somedict)
field_names
to camelcase fieldNames
, one can disable that from settings: Extending from Meta — Dataclass Wizard 0.22.1 documentationTIL another bit I won’t ever use: 21. for/else — Python Tips 0.1 documentation
This exists:
for a in whatveer:
a.whatever()
else:
print("Whatever is empty!")
Found it after having a wrong indentation of an else
that put it inside the for
loop.
Found at least three:
Spent hours tracking down a bug that boiled down to:
A if args.sth.lower == "a" else B
Guess what - args.sth.lower
is a callable, and will never be equal to a string. So args.sth.lower == "a"
is always False
.
Of course I needed args.sth.lower()
.
Python sets have two kinds of methods:
a.intersection(b)
which returns the intersectiona.intersection_update(b)
which updates a
by removing elements not found in b.It calls the function-like ones (that return the result) operators, as opposed to the ‘update_’ ones.
Given an argument -l
, I needed to pass multiple values to it.
python - How can I pass a list as a command-line argument with argparse? - Stack Overflow is an extremely detailed answer with all options, but the TL;DR is:
nargs
:parser.add_argument('-l','--list', nargs='+', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 2345 3456 4567
parser.add_argument('-l','--list', action='append', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 -l 2345 -l 3456 -l 4567
Details about values for nargs
:
# This is the correct way to handle accepting multiple arguments.
# '+' == 1 or more.
# '*' == 0 or more.
# '?' == 0 or 1.
# An int is an explicit number of arguments to accept.
parser.add_argument('--nargs', nargs='+')
Related, a couple of days ago used nargs
to allow an empty value (explicitly passing -o
without an argument that becomes a None
) while still providing a default value that’s used if -o
is omitted completely:
parser.add_argument(
"--output-dir",
"-o",
help="Target directory for the converted .json files. (%(default)s)",
type=Path,
default=DEFAULT_OUTPUT_DIR,
nargs="?",
)