Hi All,

I am having trouble calculating the real RSS memory usage by some kind
of users' jobs. Which the sacct returned wrong numbers.

Rocky Linux release 8.5, Slurm 21.08

(slurm.conf)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux

The troubling jobs are like:

1. python spawn multithreading 96 threads;

2. Each thread uses SKlearn which again spawns 96 threads using openmp.

Which is obviously over running the node, and I want to address it.

The node has 300GB RAM, but the "sacct" (and seff) reports 1.2TB
MaxRSS(also AveRSS). This does not look correct.


I am suspecting that whether the SLurm+jobacct_gather/linux repeatedly
sums up the memory used by all these threads, multiple counted the
same thing many times.

For the openMP part, maybe it is fine for slurm; while for
python/multithreading, maybe it can not work well with Slurm for
memory accounting?

So, if this is the case, maybe 1.2TB/96= 12GB MaxRSS?

I want to get the right MaxRSS to report to users.

Thanks!

Best,

Feng

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to