Users can, of course always just wrap the job itself in time  to record the 
maximum memory usage.  Bit of a naïve approach but it does work.  I agree the 
polling of current usage is not very satisfactory.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue<https://azcollaboration.sharepoint.com/sites/CMU993> |


From: greent10--- via slurm-users <slurm-users@lists.schedmd.com>
Date: Monday, 20 May 2024 at 12:10
To: Emyr James <emyr.ja...@crg.eu>, Davide DelVento <davide.quan...@gmail.com>
Cc: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: memory high water mark reporting
Hi,

We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.

With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?

Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.

I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. 
https://github.com/google/cadvisor/issues/3286<https://github.com/google/cadvisor/issues/3286>

Tom

From: Emyr James via slurm-users <slurm-users@lists.schedmd.com>
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento <davide.quan...@gmail.com>, Emyr James <emyr.ja...@crg.eu>
Cc: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.

Looking here :

https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS<https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS>

It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

________________________________
From: Emyr James via slurm-users <slurm-users@lists.schedmd.com>
Sent: 17 May 2024 01:15
To: Davide DelVento <davide.quan...@gmail.com>
Cc: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: memory high water mark reporting

Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

________________________________
From: Davide DelVento <davide.quan...@gmail.com>
Sent: 17 May 2024 01:02
To: Emyr James <emyr.ja...@crg.eu>
Cc: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage<https://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$>
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>
________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com<https://www.astrazeneca.com>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to