Slurm blues
- Man pages for the commands: Slurm Workload Manager - Man Pages
- Slurm Workload Manager - Quick Start User Guide
- GitHub - NVIDIA/pyxis: Container plugin for Slurm Workload Manager
- Slurm Resource Manger | Research IT | Trinity College Dublin
Things that work for my specific instance:
ssh-copy-id
to log in via public keykitty +kitten ssh shamotskyi@v-slurm-login
sshfs
set -o vi
in~/.bashrc
Problem: how to install packages to run my stuff
Problem: how to install my python packages?
- There’s no pip and I have no admin rights to install python3-ensurepip
- pyxls that does “containers” is there
Sample from documentation about using pyxls:
srun --mem=16384 -c4 --gres=gpu:v100:2 \
--container-image tensorflow/tensorflow:latest-gpu \
--container-mounts=/slurm/$(id -u -n):/data \
--container-workdir /data \
python program.py
Sadly my code needs some additional packages not installed by default there or anywhere, I need to install spacy language packs etc.
I have a Docker image I can use with everything installed on it, but it’s not on any public registry and I’m not gonna setup one just for this.
Solution - Container that gets saved!
You can start interactive jobs, in this case inside a docker container and it drops you inside a shell:
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image tensorflow/tensorflow:latest-gpu --container-mounts=/slurm/$(id -u -n):/data --container-workdir /data --pty bash
Couldn’t add users or install packages because nothing was writeablea, so I open the documentation, find interesting flags there:
--container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
[pyxis] the image to use for the container
filesystem. Can be either a docker image given as
an enroot URI, or a path to a squashfs file on the
remote host filesystem.
--container-name=NAME [pyxis] name to use for saving and loading the
container on the host. Unnamed containers are
removed after the slurm task is complete; named
containers are not. If a container with this name
already exists, the existing container is used and
the import is skipped.
--container-save=PATH [pyxis] Save the container state to a squashfs
file on the remote host filesystem.
--container-writable [pyxis] make the container filesystem writable
--container-readonly [pyxis] make the container filesystem read-only
So, I can get an image from Docker hub, save that container locally, and then provide that saved one instead of the image from the registry. Nice.
Or just give it a name, it will reuse it instead of reading it.
I can also make it writable.
=> I can create my own docker image, install everything there, and just go inside it to start trainings?
Final command:
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./test_saved_path --container-save ./test_saved_path_2 --container-mounts=/slurm/$(id -u -n)/data:/data --container-workdir /data --container-name my_container_name --container-writable --pty bash
It:
- Opens the container image locally, but more likely - reopens the one under its name
- Opens a shell
- Is writable, any changes I do get saved
- At the end the container itself gets saved in
./test_saved_paths_2
, just in case the open-the-named-container-by-name ever fails me. - As a bonus - I can do stuff to make the container usable, instead of the very raw default settings of the server I have no rights to change.
And a folder that locally I have mounted with sshfs
that the docker image also has transparent access to makes the entire workflow fast.
The final solution was:
- Set up the perfect Container based on the TF docker image
- Create two scripts, one that just starts the training inside it and one that drops you in a shell in that container. Both based on the command above.
(But I still wonder how the rest are doing it, I can’t believe that’s the common way to run stuff that needs an installed package…)