Slurm jobs crash due to OOM
A training that worked on my laptop gets kliled on the slurm node.
- Out-of-Memory (OOM) or Excessive Memory Usage | Ohio Supercomputer Center
- Allocating Memory | Princeton Research Computing
sstat
was hard to parse and read, wasn’t sure what I want there.
Find out the CPU time and memory usage of a slurm job - Stack Overflow
sstat
is for running jobs,sacct
is for finished jobssacct
in its examples told me that column name capitalization doesn’t matter
Ended up with this:
sacct -j 974 --format=jobid,jobname,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,alloccpus,elapsed,state,exitcode,reqcpufreqmax,reqcpufreqgov,reqmem
For running jobs:
sstat -j 975 --format=jobid,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,reqcpufreqmax,reqcpufreqgov
(Half can be removed, but my goal was to just get it to fit on screen)
W|A is still the best for conversions: 18081980K in gb - Wolfram|Alpha
Other things I learned:
-
You can use suffixes in args like
--mem=200G
-
--mem=0
should give access to all the memory, doesn’t work for me though -
You can do a task farm to run many instances of the same command with diff params: Slurm task-farming for Python scripts | Research IT | Trinity College Dublin
-
Found more helpful places