[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
Looking here :

https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS

It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 17 May 2024 01:15
To: Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread greent10--- via slurm-users
Hi,

We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.

With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?

Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.

I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. https://github.com/google/cadvisor/issues/3286

Tom

From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento , Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.

Looking here :

https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS

It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 17 May 2024 01:15
To: Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot of overhead and in any case this seems to not be a sensible way to get 
values that should just be determined right at the end by an event rather than 
using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgroup.h  to 
include it ?

typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
uint64_t mas_rss;
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps looking at the current value and 
then keeps track of the max of these values. But the actual max may occur in 
between 2 polls so it would never see the true max value. At least by also 
reading memory.peak there is a chance to get closer to the real value with the 
polling method even if this not  optimal. Ideally it should run this during 
cleanup of tasks as well as at the poll interval.

As an aside, I also did a grep for getrusage and it doesn't seem to be used at 
all. I see that it is looking at /proc/%d/stat so maybe this is where its 
getting the maxrss for non cgroup accounting. Still, getrusage would seem to be 
the more obvious choice for this ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil 
Sent: 20 May 2024 13:08
To: Emyr James ; Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting


Hi,



We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.



With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?



Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.



I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. 
https://github.com/google/cadvisor/issues/3286



Tom



From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento , Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

External email to Cardiff University - Take care when replying/opening 
attachments or links.

Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Looking here :



https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS



It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?



Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation





From: Emyr James via slurm-users 
Sent: 17 May 2024 01:15
To: Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting



Hi,



I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.



Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.



But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.



Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation





From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark re

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
A bit more digging

the cgroups stuff seems to be communicating back the values it finds in 
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c

prec->tres_data[TRES_ARRAY_MEM].size_read =
cgroup_acct_data->total_rss;

I can't find anywhere in the code where it seems to be keeping track of the max 
value of total_rss seen so I can only conclude that it must be done in the 
database when slurmdbd puts in the values rather than being done in the slurm 
binaries themselves.

So this does seem to suggest that the peak value that is accounted at the end 
is just the maximum of the memory.current values that it sees over all the 
polls, even though there may be much higher transient values that may have 
occured in between the polls which would be taken into account by memory.peak 
but slurm never sees these values.

Can anyone more familiar with the code than me corrobarate this ?

Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and 
then the accounting db reports the highest seen even though using getrusage and 
checking ru_maxrss should be done too ?

Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 20 May 2024 13:56
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgroup.h  to 
include it ?

typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
uint64_t mas_rss;
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps looking at the current value and 
then keeps track of the max of these values. But the actual max may occur in 
between 2 polls so it would never see the true max value. At least by also 
reading memory.peak there is a chance to get closer to the real value with the 
polling method even if this not  optimal. Ideally it should run this during 
cleanup of tasks as well as at the poll interval.

As an aside, I also did a grep for getrusage and it doesn't seem to be used at 
all. I see that it is looking at /proc/%d/stat so maybe this is where its 
getting the maxrss for non cgroup accounting. Still, getrusage would seem to be 
the more obvious choice for this ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil 
Sent: 20 May 2024 13:08
To: Emyr James ; Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting


Hi,



We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.



With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?



Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.



I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. 
https://github.com/google/cadvisor/issues/3286



Tom



From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento , Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

External email to Cardiff University - Take care when replying/opening 
attachments or links.

Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Looking here :



https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users

I changed the following in  src/plugins/cgroup/v2/cgroup_v2.c

   if (common_cgroup_get_param(&task_cg_info->task_cg,
"memory.current",
&memory_current,
&tmp_sz) != SLURM_SUCCESS) {
if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peak file");
else
   log_flag(CGROUP, "Cannot read task %d memory.peak file",
task_id);
   }

to

   if (common_cgroup_get_param(&task_cg_info->task_cg,
"memory.peak",
&memory_current,
&tmp_sz) != SLURM_SUCCESS) {
if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peak file");
else
   log_flag(CGROUP, "Cannot read task %d memory.peak file",
task_id);
   }

and am using a polling interval of 5s. the values I get when adding this to the 
end of a batch script :

dir=$(awk -F: '{print $NF}' /proc/self/cgroup)
echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat 
/sys/fs/cgroup$dir/memory.peak`
echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)

compared to what is in maxrss from sacct seem to be spot on for my test jobs at 
least. I guess this will do for now but it still feels very unsatisfactory to 
be using polling for this instead of having the code trigger the relevant stuff 
on job cleanup.

The downside of this "quick fix" is that now during a job run, sstat will 
report the max memory seen so far rather than the current usage. Personally I 
think this is not particularly useful anyway and if you really need to track 
memory usage as a job is running the LD_PRELOAD methods mentioned previously 
are better.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James 
Sent: 20 May 2024 14:30
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento ; Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting

A bit more digging

the cgroups stuff seems to be communicating back the values it finds in 
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c

prec->tres_data[TRES_ARRAY_MEM].size_read =
cgroup_acct_data->total_rss;

I can't find anywhere in the code where it seems to be keeping track of the max 
value of total_rss seen so I can only conclude that it must be done in the 
database when slurmdbd puts in the values rather than being done in the slurm 
binaries themselves.

So this does seem to suggest that the peak value that is accounted at the end 
is just the maximum of the memory.current values that it sees over all the 
polls, even though there may be much higher transient values that may have 
occured in between the polls which would be taken into account by memory.peak 
but slurm never sees these values.

Can anyone more familiar with the code than me corrobarate this ?

Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and 
then the accounting db reports the highest seen even though using getrusage and 
checking ru_maxrss should be done too ?

Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 20 May 2024 13:56
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgroup.h  to 
include it ?

typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
uint64_t mas_rss;
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps looking at the current value and 
then keeps track of the max of these values. But the actual max may occur in 
between 2 polls so it would never see the true max value. At least by also 
reading memory.peak there is a chance to get closer to the real value with the 
polling method even if this not  optimal. Ideally it should run this during 
cleanup of tasks as well as at the poll interval.

As an aside, I also did a grep for getrusage and it doesn't seem to be used at 
all. I see that it is looking at /proc/%d/stat so

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Ryan Cox via slurm-users
We have a pretty ugly patch that calls out to a script from 
common_cgroup_delete() in src/plugins/cgroup/common/cgroup_common.c.  It 
checks that it's the job cgroup being deleted ("/job_*" as the path).  
The script collects the data and stores it elsewhere.


It's a really ugly way of doing it and I wish there was something 
better.  It seems like this could be a good spot for a SPANK hook.


Ryan

On 5/20/24 09:32, Emyr James via slurm-users wrote:


I changed the following in  src/plugins/cgroup/v2/cgroup_v2.c

       if (common_cgroup_get_param(&task_cg_info->task_cg,
*"memory.current"*,
                        &memory_current,
                        &tmp_sz) != SLURM_SUCCESS) {
            if (task_id == task_special_id)
                log_flag(CGROUP, "Cannot read task_special memory.peak 
file");

            else
               log_flag(CGROUP, "Cannot read task %d memory.peak file",
                    task_id);
       }

to

       if (common_cgroup_get_param(&task_cg_info->task_cg,
* "memory.peak"*,
                        &memory_current,
                        &tmp_sz) != SLURM_SUCCESS) {
            if (task_id == task_special_id)
                log_flag(CGROUP, "Cannot read task_special memory.peak 
file");

            else
               log_flag(CGROUP, "Cannot read task %d memory.peak file",
                    task_id);
       }

and am using a polling interval of 5s. the values I get when adding 
this to the end of a batch script :


dir=$(awk -F: '{print $NF}' /proc/self/cgroup)
echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat 
/sys/fs/cgroup$dir/memory.peak`

echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)

compared to what is in maxrss from sacct seem to be spot on for my 
test jobs at least. I guess this will do for now but it still feels 
very unsatisfactory to be using polling for this instead of having the 
code trigger the relevant stuff on job cleanup.


The downside of this "quick fix" is that now during a job run, sstat 
will report the max memory seen so far rather than the current usage. 
Personally I think this is not particularly useful anyway and if you 
really need to track memory usage as a job is running the LD_PRELOAD 
methods mentioned previously are better.


Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


*From:* Emyr James 
*Sent:* 20 May 2024 14:30
*To:* Thomas Green - Staff in University IT, Research Technologies / 
Staff Technoleg Gwybodaeth, Technolegau Ymchwil 
; Davide DelVento ; 
Emyr James 

*Cc:* slurm-users@lists.schedmd.com 
*Subject:* Re: [slurm-users] Re: memory high water mark reporting
A bit more digging

the cgroups stuff seems to be communicating back the values it finds 
in src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c


        prec->tres_data[TRES_ARRAY_MEM].size_read =
            cgroup_acct_data->total_rss;

I can't find anywhere in the code where it seems to be keeping track 
of the max value of total_rss seen so I can only conclude that it must 
be done in the database when slurmdbd puts in the values rather than 
being done in the slurm binaries themselves.


So this does seem to suggest that the peak value that is accounted at 
the end is just the maximum of the memory.current values that it sees 
over all the polls, even though there may be much higher transient 
values that may have occured in between the polls which would be taken 
into account by memory.peak but slurm never sees these values.


Can anyone more familiar with the code than me corrobarate this ?

Presumably non-cgroup accounting has a similar issue ? I.e. it polls 
rss and then the accounting db reports the highest seen even though 
using getrusage and checking ru_maxrss should be done too ?


Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


*From:* Emyr James via slurm-users 
*Sent:* 20 May 2024 13:56
*To:* Thomas Green - Staff in University IT, Research Technologies / 
Staff Technoleg Gwybodaeth, Technolegau Ymchwil 
; Davide DelVento 

*Cc:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] Re: memory high water mark reporting
Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. 
memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c


Adding the ability to get memory.peak in this source file seems to be 
something that should be done?


Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t 
task_id) be modified to include looking at memory.peak ?


This may mean needing to modify the acct_stat struct in 
interfaces/cgroup.h  to include it ?


typedef struct {
  uint64_t usec;
  uint64_t ssec;
  uint64_t total_rss;
*uint64_t mas_rss;*
  uint64_t total_pgmajfault;
  uint64_t total_vmem;
} cgroup_acct_t;

Presumably, with the polling method, it keeps

[slurm-users] Apply an specific QoS to all users that belongs to an specific account

2024-05-20 Thread Gestió Servidors via slurm-users
Hi,

I would like to know if it is possible to apply an specific QoS to all users 
that belongs to an specific account. For example, I have created some new users 
"user_XX" and, also, I have created their new accounts in SLURM with "sacctmgr 
create account name=Test" and "sacctmgr create user name=user_XX 
DefaultAccount=Test". After it, I have changed default QoS "normal" to a new 
QoS "minimal" (with some limits) to account "Test" (sacctmgr modify account 
where name=Test set qos=minimal), but what I have seen is that users "user_XX" 
that belongs to "Test" account continue in QoS "normal" (QoS by default), so it 
seems that users have not inherited QoS applied to their account.

Is there any way to do this and am I doing something wrong?

Thanks.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread greent10--- via slurm-users
Hi,

I came to same conclusion and spotted similar bits of the code where code could 
be changed to get what was required.  Without a new variable it will be tricky 
to implement properly due to way those existing variables are used and defined. 
 Maybe a PeakMem variable in Slurm accounting database to capture this is 
required if enough interest in this feature.

N.B. I got confused with the memory – total_rss is already used, 
max_usage_in_bytes in cgroupsv1 is the only one (similar to peak in cgroupsv2).

Maybe only proper way is to monitor this sort of thing outside of Slurm with 
tools such as XDMOD.

Tom

From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 16:46
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

I changed the following in  src/plugins/cgroup/v2/cgroup_v2.c

   if (common_cgroup_get_param(&task_cg_info->task_cg,
"memory.current",
&memory_current,
&tmp_sz) != SLURM_SUCCESS) {
if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peak file");
else
   log_flag(CGROUP, "Cannot read task %d memory.peak file",
task_id);
   }

to

   if (common_cgroup_get_param(&task_cg_info->task_cg,
"memory.peak",
&memory_current,
&tmp_sz) != SLURM_SUCCESS) {
if (task_id == task_special_id)
log_flag(CGROUP, "Cannot read task_special memory.peak file");
else
   log_flag(CGROUP, "Cannot read task %d memory.peak file",
task_id);
   }

and am using a polling interval of 5s. the values I get when adding this to the 
end of a batch script :

dir=$(awk -F: '{print $NF}' /proc/self/cgroup)
echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat 
/sys/fs/cgroup$dir/memory.peak`
echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)

compared to what is in maxrss from sacct seem to be spot on for my test jobs at 
least. I guess this will do for now but it still feels very unsatisfactory to 
be using polling for this instead of having the code trigger the relevant stuff 
on job cleanup.

The downside of this "quick fix" is that now during a job run, sstat will 
report the max memory seen so far rather than the current usage. Personally I 
think this is not particularly useful anyway and if you really need to track 
memory usage as a job is running the LD_PRELOAD methods mentioned previously 
are better.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James 
Sent: 20 May 2024 14:30
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento ; Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Re: memory high water mark reporting

A bit more digging

the cgroups stuff seems to be communicating back the values it finds in 
src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c

prec->tres_data[TRES_ARRAY_MEM].size_read =
cgroup_acct_data->total_rss;

I can't find anywhere in the code where it seems to be keeping track of the max 
value of total_rss seen so I can only conclude that it must be done in the 
database when slurmdbd puts in the values rather than being done in the slurm 
binaries themselves.

So this does seem to suggest that the peak value that is accounted at the end 
is just the maximum of the memory.current values that it sees over all the 
polls, even though there may be much higher transient values that may have 
occured in between the polls which would be taken into account by memory.peak 
but slurm never sees these values.

Can anyone more familiar with the code than me corrobarate this ?

Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and 
then the accounting db reports the highest seen even though using getrusage and 
checking ru_maxrss should be done too ?

Many thanks,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 20 May 2024 13:56
To: Thomas Green - Staff in University IT, Research Technologies / Staff 
Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide 
DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Siwmae Thomas,

I grepped for memory.peak in the source and it's not there. memory.current is 
there and is used in src/plugins/cgroup/v2/cgroup_v2.c

Adding the ability to get memory.peak in this source file seems to be something 
that should be done?

Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be 
modified to include looking at memory.peak ?

This may mean needing to modify the acct_stat struct in interfaces/cgro

[slurm-users] Invalid/incorrect gres.conf syntax

2024-05-20 Thread Gestió Servidors via slurm-users
Hello,

I have configured my "gres.conf" in this way:
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 
File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 
File=/dev/nvidia0 Cores=0-7

node-gpu-1 and node-gpu-2 are two systems with two sockets; node-gpu-3 and 
node-gpu-4 have only one socket.


In my "slurm.conf" I have these lines:
AccountingStorageTRES=gres/gpu
SelectType=select/cons_tres
GresTypes=gpu
NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=7800 Gres=gpu:GeForceRTX3080:1

However, slurmctld reports warning logs about "error syntax in Cores attribute" 
in gres.conf.

Where is the syntax error?

Thanks a lot!


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Problems with gres.conf

2024-05-20 Thread Gestió Servidors via slurm-users
Hello,

I am trying to rewrite my gres.conf file.

Before changes, this file was just like this:
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 
File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 
File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 
File=/dev/nvidia0 Cores=0-7
# you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, 
whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one


And my slurmd.conf was this:
[...]
AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 
ThreadsPerCore=2 RealMemory=47000
[...]

With this configuration, all seems works fine, except slurmctld.log reports:
[...]
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-11) 
on node node-gpu-3
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) 
on node node-gpu-1
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) 
on node node-gpu-2
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) 
on node node-gpu-4
[...]

However, even these errors, users can submit jobs and request GPUs resources.



Now, I have tried to reconfigure gres.conf and slurmd.conf in this way:
gres.conf:
Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
# there is no NodeName attribute

slurmd.conf:
[...]
NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 
RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 
RealMemory=47000
# there is no CPUs attribute
[...]


With this new configuration, nodes with GPU start correctly slurmd.service 
daemon, but nodes without GPU (node-worker-[0-22]) can't start slurmd.service 
daemon and returns this error:
[...]
error: Waiting for gres.conf file /dev/nvidia0
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
[...]

It seems SLURM is waiting that "node-workers" have also an nvidia GPU but not, 
theses nodes haven't GPU... So, where is my configuration error?

I have read in https://slurm.schedmd.com/gres.conf.html about syntax and 
examples but it seems I'm doing some wrong.

Thanks!!

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Running slurm on alternate ports

2024-05-20 Thread Alan Stange via slurm-users
Hello all,

for testing purposes, we would like to run slurm on ports different from
the default values.   No problems in setting this up.  But how does one
tell srun/sbatch/etc what the different port numbers are?   I see no
command line options to specify a port or an alternate configuration file.


Thank you,

Alan

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Running slurm on alternate ports

2024-05-20 Thread Groner, Rob via slurm-users
It gets them from the slurm.conf file. So wherever you are executing 
srun/sbatch/etc, it should have access to the slurm config files.


From: Alan Stange via slurm-users 
Sent: Monday, May 20, 2024 2:55 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Running slurm on alternate ports

Hello all,

for testing purposes, we would like to run slurm on ports different from
the default values.   No problems in setting this up.  But how does one
tell srun/sbatch/etc what the different port numbers are?   I see no
command line options to specify a port or an alternate configuration file.


Thank you,

Alan

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Running slurm on alternate ports

2024-05-20 Thread Groner, Rob via slurm-users
Since you mentioned "an alternate configuration file", look at the bottom of 
the sbatch online docs.  It describes a SLURM_CONF env var you can set that 
points to the config files.

Rob


From: Groner, Rob via slurm-users 
Sent: Monday, May 20, 2024 3:24 PM
To: slurm-users@lists.schedmd.com ; Alan Stange 

Subject: [slurm-users] Re: Running slurm on alternate ports

It gets them from the slurm.conf file. So wherever you are executing 
srun/sbatch/etc, it should have access to the slurm config files.


From: Alan Stange via slurm-users 
Sent: Monday, May 20, 2024 2:55 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Running slurm on alternate ports

Hello all,

for testing purposes, we would like to run slurm on ports different from
the default values.   No problems in setting this up.  But how does one
tell srun/sbatch/etc what the different port numbers are?   I see no
command line options to specify a port or an alternate configuration file.


Thank you,

Alan

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm not allocating correct cgroup cpu ids in srun step (possible bug)

2024-05-20 Thread Ashley Wright via slurm-users
Hi,

At our site we have recently upgraded to Slurm 23.11.5 and are having trouble 
with MPI jobs doing srun inside a sbatch'ed script.

The cgroup does not appear to be setup correctly for the srun (step_0).

As an example
$ cat /sys/fs/cgroup/cpuset/slurm/uid_11000/job/cpuset.cpus
0,2-3,68-69,96,98-99,164-165
$ cat /sys/fs/cgroup/cpuset/slurm/uid_11000/job/step_0/cpuset.cpus
0,2,68,96,98,164

The sbatch is allocated a range of cpus in the cgroup. However, when step_0 is 
run, only some of those CPUs are in the group.
I have noticed that it is always the range which is missing, ie 2-5 only 2 is 
included, 3,4,5 are missing.
This also only happens if there are multiple groups of cpus in the allocations. 
ie only 1-12 would be fine, however 1-12,15-20 would result in 1,15 only.

The sbatch also seems fine, with step_batch and step_extern being allocated 
correctly.

This causes numerous issues with MPI jobs as they end up overloading cpus.


We are running our nodes with threading enabled on the CPUs, and with cgroups 
and affinity plugins.

I have attached our slurm.conf to show our settings.

Our /etc/slurm/cgroup.conf is
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes


We have turned on logging at debug2 level, but I haven't yet found anything 
useful. Happy for a suggestion on what to look for.


Is anyone able to provide any advice on where to go next to try and identify 
the issue?

Regards,
Ashley Wright


slurm.conf
Description: slurm.conf

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com