Dear Daryl,
I once posed the same question, and got a dear answer here in the forum some
while ago. So, I just forward it approximately.
RSS appears to include double counting of memory that is occupied by shared
libraries. I was proposed to switch to PSS
https://slurm.schedmd.com/slurm.conf
Hey All,
I was just hoping to find out if anyone can explain how a job running on a
single node was able to have a MaxRSS of 240% reported by seff. Below is some
specifics about the job that was run. We're using slurm 19.05.7 on CentOS 8.2/
.
[root@hpc-node01 ~]# scontrol show jobid -dd 97036
J
Turns out on that new node I was running hwloc in a cgroup restricted
to cores 0-13 so that was causing the issue.
In an unrestricted cgroup shell, "hwloc-ls --only pu" works properly
and gives me the correct SLURM mapping.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Thu, 15 Dec 2022
Hmm…
That one is strange.
Can you try just hwloc-ls?
I wonder, how slurmd would get that information, if it is not hwloc-based
Best
Marcus
Von unterwegs gesendet.
> Am 15.12.2022 um 16:00 schrieb Paul Raines :
>
>
> Nice find!
>
> Unfortunately this does not work on the original box this
Nice find!
Unfortunately this does not work on the original box this whole
issue started on where I found the "alternating scheme"
# slurmd -C
NodeName=foobar CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=2 RealMemory=256312
UpTime=5-14:55:31
# hwloc-ls --only pu
PU L#0
Marcus Wagner writes:
> That depends on what is meant with formatting argument.
Yes, they could surely have defined that.
> etc. And I would assume, that -S, -E and -T are filtering options, not
> formatting options.
I'd describe -T as a formatting option:
-T, --truncate