12 Jul 2022

Slurm jobs crash due to OOM

A training that worked on my laptop gets kliled on the slurm node.

sstat was hard to parse and read, wasn’t sure what I want there.

Find out the CPU time and memory usage of a slurm job - Stack Overflow

sstat is for running jobs, sacct is for finished jobs
sacct in its examples told me that column name capitalization doesn’t matter

Ended up with this:

 sacct -j 974 --format=jobid,jobname,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,alloccpus,elapsed,state,exitcode,reqcpufreqmax,reqcpufreqgov,reqmem

For running jobs:

 sstat -j 975 --format=jobid,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,reqcpufreqmax,reqcpufreqgov

(Half can be removed, but my goal was to just get it to fit on screen)

W|A is still the best for conversions: 18081980K in gb - Wolfram|Alpha

Other things I learned:

You can use suffixes in args like --mem=200G
--mem=0 should give access to all the memory, doesn’t work for me though
You can do a task farm to run many instances of the same command with diff params: Slurm task-farming for Python scripts | Research IT | Trinity College Dublin
Found more helpful places
- Slurm Resource Manger | Research IT | Trinity College Dublin
- Automating job submission: SLURM Job Submission with R, Python, Bash | Research Computing Lessons

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net

Slurm jobs crash due to OOM