[slurm-users] Re: memory high water mark reporting

Ryan Cox via slurm-users Mon, 20 May 2024 09:13:37 -0700

We have a pretty ugly patch that calls out to a script fromcommon_cgroup_delete() in src/plugins/cgroup/common/cgroup_common.c. Itchecks that it's the job cgroup being deleted ("/job_*" as the path). The script collects the data and stores it elsewhere.

It's a really ugly way of doing it and I wish there was somethingbetter. It seems like this could be a good spot for a SPANK hook.


Ryan

On 5/20/24 09:32, Emyr James via slurm-users wrote:

I changed the following in  src/plugins/cgroup/v2/cgroup_v2.c

       if (common_cgroup_get_param(&task_cg_info->task_cg,
*"memory.current"*,
                        &memory_current,
                        &tmp_sz) != SLURM_SUCCESS) {
            if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peakfile");
            else
               log_flag(CGROUP, "Cannot read task %d memory.peak file",
                    task_id);
       }

to

       if (common_cgroup_get_param(&task_cg_info->task_cg,
* "memory.peak"*,
                        &memory_current,
                        &tmp_sz) != SLURM_SUCCESS) {
            if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peakfile");
            else
               log_flag(CGROUP, "Cannot read task %d memory.peak file",
                    task_id);
       }
and am using a polling interval of 5s. the values I get when addingthis to the end of a batch script :
dir=$(awk -F: '{print $NF}' /proc/self/cgroup)
echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat/sys/fs/cgroup$dir/memory.peak`
echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)
compared to what is in maxrss from sacct seem to be spot on for mytest jobs at least. I guess this will do for now but it still feelsvery unsatisfactory to be using polling for this instead of having thecode trigger the relevant stuff on job cleanup.
The downside of this "quick fix" is that now during a job run, sstatwill report the max memory seen so far rather than the current usage.Personally I think this is not particularly useful anyway and if youreally need to track memory usage as a job is running the LD_PRELOADmethods mentioned previously are better.
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

------------------------------------------------------------------------
*From:* Emyr James <[email protected]>
*Sent:* 20 May 2024 14:30
*To:* Thomas Green - Staff in University IT, Research Technologies /Staff Technoleg Gwybodaeth, Technolegau Ymchwil<[email protected]>; Davide DelVento <[email protected]>;Emyr James <[email protected]>
*Cc:* [email protected] <[email protected]>
*Subject:* Re: [slurm-users] Re: memory high water mark reporting
A bit more digging....
the cgroups stuff seems to be communicating back the values it findsin src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
        prec->tres_data[TRES_ARRAY_MEM].size_read =
            cgroup_acct_data->total_rss;
I can't find anywhere in the code where it seems to be keeping trackof the max value of total_rss seen so I can only conclude that it mustbe done in the database when slurmdbd puts in the values rather thanbeing done in the slurm binaries themselves.
So this does seem to suggest that the peak value that is accounted atthe end is just the maximum of the memory.current values that it seesover all the polls, even though there may be much higher transientvalues that may have occured in between the polls which would be takeninto account by memory.peak but slurm never sees these values.
Can anyone more familiar with the code than me corrobarate this ?
Presumably non-cgroup accounting has a similar issue ? I.e. it pollsrss and then the accounting db reports the highest seen even thoughusing getrusage and checking ru_maxrss should be done too ?
Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

------------------------------------------------------------------------
*From:* Emyr James via slurm-users <[email protected]>
*Sent:* 20 May 2024 13:56
*To:* Thomas Green - Staff in University IT, Research Technologies /Staff Technoleg Gwybodaeth, Technolegau Ymchwil<[email protected]>; Davide DelVento <[email protected]>
*Cc:* [email protected] <[email protected]>
*Subject:* [slurm-users] Re: memory high water mark reporting
Siwmae Thomas,
I grepped for memory.peak in the source and it's not there.memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to besomething that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_ttask_id) be modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct ininterfaces/cgroup.h to include it ?
typedef struct {
      uint64_t usec;
      uint64_t ssec;
      uint64_t total_rss;
*uint64_t mas_rss;*
      uint64_t total_pgmajfault;
      uint64_t total_vmem;
} cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the currentvalue and then keeps track of the max of these values. But the actualmax may occur in between 2 polls so it would never see the true maxvalue. At least by also reading memory.peak there is a chance to getcloser to the real value with the polling method even if this notoptimal. Ideally it should run this during cleanup of tasks as well asat the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to beused at all. I see that it is looking at /proc/%d/stat so maybe thisis where its getting the maxrss for non cgroup accounting. Still,getrusage would seem to be the more obvious choice for this ?
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

------------------------------------------------------------------------
*From:* Thomas Green - Staff in University IT, Research Technologies /Staff Technoleg Gwybodaeth, Technolegau Ymchwil <[email protected]>
*Sent:* 20 May 2024 13:08
*To:* Emyr James <[email protected]>; Davide DelVento<[email protected]>
*Cc:* [email protected] <[email protected]>
*Subject:* Re: [slurm-users] Re: memory high water mark reporting

Hi,
We have had similar questions from users regarding how best to findout the high memory peak of a job since they may run a job and get anot very useful value for variables in sacct such as the MaxRSS sinceSlurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytestakes into account caches so can vary on how much I/O is used whilsttotal_rss in memory.stats looks more useful maybe. Maybe memory.peakis clearer?
Its not clear in the documentation how a user should in the sacctvalues to infer the actual usage of jobs to correct their behaviour infuture submissions.
I would be keen to see improvements in high water mark reporting. Inoticed that the jobacctgather plugin documentation was deleted backin Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g.https://github.com/google/cadvisor/issues/3286<https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$>
Tom

*From: *Emyr James via slurm-users <[email protected]>
*Date: *Monday, 20 May 2024 at 10:50
*To: *Davide DelVento <[email protected]>, Emyr James<[email protected]>
*Cc: *[email protected] <[email protected]>
*Subject: *[slurm-users] Re: memory high water mark reporting
*External email to Cardiff University - *Take care whenreplying/opening attachments or links.
*Nid ebost mewnol o Brifysgol Caerdydd yw hwn - *Cymerwch ofal wrthateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS<https://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$>
It looks like it's possible to hook something in at the right placeusing the slurm_spank_task_exit or slurm_spank_exit plugins. Doesanyone have any experience or examples of doing this ? Is there anymore documentation available on this functionality ?
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

------------------------------------------------------------------------

*From:*Emyr James via slurm-users <[email protected]>
*Sent:* 17 May 2024 01:15
*To:* Davide DelVento <[email protected]>
*Cc:* [email protected] <[email protected]>
*Subject:* [slurm-users] Re: memory high water mark reporting

Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I shouldsee if I can force slurmstepd to be run with that LD_PRELOAD and thensee if that does it.
Ultimately am trying to get all the useful accounting metrics into aclickhouse database. If the LD_PRELOAD on slurmstepd seems to workthen I can expand it to insert the relevant row into the clickhouse DBin the C code of the preload library.
But still...this seems like a very basic thing to do and am verysuprised that it seems so difficult to do this with the standardaccounting recording out of the box.
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

------------------------------------------------------------------------

*From:*Davide DelVento <[email protected]>
*Sent:* 17 May 2024 01:02
*To:* Emyr James <[email protected]>
*Cc:* [email protected] <[email protected]>
*Subject:* Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but ifyou can get to prefix whatever is executed with thishttps://github.com/NCAR/peak_memusage<https://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$>(which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users<[email protected]> wrote:
    Hi,

    We are trying out slurm having been running grid engine for a long
    while.

    In grid engine, the cgroups peak memory and max_rss are generated
    at the end of a job and recorded. It logs the information from the
    cgroup hierarchy as well as doing a getrusage call right at the
    end on the parent pid of the whole job "container" before cleaning up.

    With slurm it seems that the only way memory is recorded is by the
    acct gather polling. I am trying to add something in an epilog
    script to get the memory.peak but It looks like the cgroup
    hierarchy has been destroyed by the time the epilog is run.

    Where in the code is the cgroup hierarchy cleared up ? Is there no
    way to add something in so that the accounting is updated during
    the job cleanup process so that peak memory usage can be
    accurately logged ?

    I can reduce the polling interval from 30s to 5s but don't know if
    this causes a lot of overhead and in any case this seems to not be
    a sensible way to get values that should just be determined right
    at the end by an event rather than using polling.

    Many thanks,

    Emyr
--slurm-users mailing list -- [email protected]
    To unsubscribe send an email to [email protected]


--
Ryan Cox
Director
Office of Research Computing
Brigham Young University

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: memory high water mark reporting

Reply via email to