In the middle of the desert you can say anything you want

12 Jul 2022

Slurm jobs crash due to OOM

A training that worked on my laptop gets kliled on the slurm node.

sstat was hard to parse and read, wasn’t sure what I want there.

Find out the CPU time and memory usage of a slurm job - Stack Overflow

  • sstat is for running jobs, sacct is for finished jobs
  • sacct in its examples told me that column name capitalization doesn’t matter

Ended up with this:

 sacct -j 974 --format=jobid,jobname,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,alloccpus,elapsed,state,exitcode,reqcpufreqmax,reqcpufreqgov,reqmem

For running jobs:

 sstat -j 975 --format=jobid,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,reqcpufreqmax,reqcpufreqgov

(Half can be removed, but my goal was to just get it to fit on screen)

W|A is still the best for conversions: 18081980K in gb - Wolfram|Alpha

Other things I learned:

Nel mezzo del deserto posso dire tutto quello che voglio.