[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
Looking here : https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread greent10--- via slurm-users
Hi, We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage. With Cgroupv1 looking

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
Siwmae Thomas, I grepped for memory.peak in the source and it's not there. memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c Adding the ability to get memory.peak in this source file seems to be something that should be done? Should extern cgroup_acct_t *cgroup_p_task_ge

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
A bit more digging the cgroups stuff seems to be communicating back the values it finds in src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c prec->tres_data[TRES_ARRAY_MEM].size_read = cgroup_acct_data->total_rss; I can't find anywhere in the code where it seems

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
I changed the following in src/plugins/cgroup/v2/cgroup_v2.c if (common_cgroup_get_param(&task_cg_info->task_cg, "memory.current", &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Ryan Cox via slurm-users
We have a pretty ugly patch that calls out to a script from common_cgroup_delete() in src/plugins/cgroup/common/cgroup_common.c.  It checks that it's the job cgroup being deleted ("/job_*" as the path).  The script collects the data and stores it elsewhere. It's a really ugly way of doing it a

[slurm-users] Apply an specific QoS to all users that belongs to an specific account

2024-05-20 Thread Gestió Servidors via slurm-users
Hi, I would like to know if it is possible to apply an specific QoS to all users that belongs to an specific account. For example, I have created some new users "user_XX" and, also, I have created their new accounts in SLURM with "sacctmgr create account name=Test" and "sacctmgr create user nam

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread greent10--- via slurm-users
Hi, I came to same conclusion and spotted similar bits of the code where code could be changed to get what was required. Without a new variable it will be tricky to implement properly due to way those existing variables are used and defined. Maybe a PeakMem variable in Slurm accounting databa

[slurm-users] Invalid/incorrect gres.conf syntax

2024-05-20 Thread Gestió Servidors via slurm-users
Hello, I have configured my "gres.conf" in this way: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeFor

[slurm-users] Problems with gres.conf

2024-05-20 Thread Gestió Servidors via slurm-users
Hello, I am trying to rewrite my gres.conf file. Before changes, this file was just like this: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=no

[slurm-users] Running slurm on alternate ports

2024-05-20 Thread Alan Stange via slurm-users
Hello all, for testing purposes, we would like to run slurm on ports different from the default values.   No problems in setting this up.  But how does one tell srun/sbatch/etc what the different port numbers are?   I see no command line options to specify a port or an alternate configuration file

[slurm-users] Re: Running slurm on alternate ports

2024-05-20 Thread Groner, Rob via slurm-users
It gets them from the slurm.conf file. So wherever you are executing srun/sbatch/etc, it should have access to the slurm config files. From: Alan Stange via slurm-users Sent: Monday, May 20, 2024 2:55 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] R

[slurm-users] Re: Running slurm on alternate ports

2024-05-20 Thread Groner, Rob via slurm-users
Since you mentioned "an alternate configuration file", look at the bottom of the sbatch online docs. It describes a SLURM_CONF env var you can set that points to the config files. Rob From: Groner, Rob via slurm-users Sent: Monday, May 20, 2024 3:24 PM To: slu

[slurm-users] Slurm not allocating correct cgroup cpu ids in srun step (possible bug)

2024-05-20 Thread Ashley Wright via slurm-users
Hi, At our site we have recently upgraded to Slurm 23.11.5 and are having trouble with MPI jobs doing srun inside a sbatch'ed script. The cgroup does not appear to be setup correctly for the srun (step_0). As an example $ cat /sys/fs/cgroup/cpuset/slurm/uid_11000/job/cpuset.cpus 0,2-3,6