[slurm-users] Slurm-power-management

2023-06-17 Thread Werf, C.G. van der (Carel)
I have an HPC cluster with 9 computenodes, controlled by slurm, all of them are 
identical and in the same partition.

I have set up Slurm' Power Saving Mode, which works quite well.
An idle node will be shut down after being idle for 30 minutes, and resumed on 
demand.

Load of complete cluster is not that high in some periods of time, but the side 
effect is that always NODE01 will be restarted first, and actually NODE08 will 
almost never be restarted.
(I started registering the amount of resume-processes about 8 months ago... 
Node01 was restarted 7 times as much as node08; node02 5 times as much etc...)

I am looking for a method to somehow "randomize" the resume schedule. 

Does anyone have an idea on how to establish this ?


With Regards,

| Carel van der Werf | 
| Developer/Administrator Linux | ICT-Bèta | Department of Science | 
Utrecht University |






Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-17 Thread Tim Schneider

Hi,

I just want to wrap this up in case someone has the same issue in the 
future.


As Reed pointed out, Ubuntu 22 does not support cgroups v1 anymore. At 
the same time, the slurm-wlm package in the Ubuntu repositories uses 
cgroups v1, which makes its task/cgroup plugin incompatible with Ubuntu 22.


My solution was to build Slurm 22.05 manually, while ensuring that 
/libdbus-1-dev/ is installed (as otherwise cgroups v2 support does not 
get built). This takes a bit more time but seems to work so far.


Thanks a lot Reed & Abel for your advice!

Best,

Tim

On 6/16/23 10:42, Tim Schneider wrote:


Hi again,

I just realized that 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at 
some point that he build Slurm 22 instead of using the Ubuntu repo 
version. So I guess I will have to look into that.


Best,

Tim

On 6/16/23 10:36, Tim Schneider wrote:


Hi Abel and Reed,

thanks a lot for your quick replies!

I did indeed just install slurm-wlm from the Ubuntu repos.

Following the advice of 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1, I tried 
disabling cgroups v1 on Ubuntu, but that just leads to an error 
during startup of slurmd:


/slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so//
//slurmd: error: unable to mount freezer cgroup namespace: Invalid 
argument//

//slurmd: error: unable to create freezer cgroup namespace//
//slurmd: error: Couldn't load specified plugin name for 
proctrack/cgroup: Plugin init() callback failed//

//slurmd: error: cannot create proctrack context for proctrack/cgroup//
//slurmd: error: slurmd initialization failed/

So it seems that slurmd is using cgroups v1. This is also reflected 
in the mounts (for the output below, cgroups v1 is enabled again):


/$ mount | grep cgroup//
//cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)//
//cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)/


What is still confusing to me is that the slurmd logs indicate no 
error when I try running with cgroups v1 enabled and the error only 
appears on the slurmctld side.


Do you know how I can enable cgroups v2 in Slurm? To me it seems that 
this is what 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 did.


Best,

Tim

On 6/16/23 03:28, abel pinto wrote:
Indeed, the issue seems to be that Ubuntu 22.04 does not support 
cgroups v1 anymore. Does SLURM support cgroupsv2? It seems so: 
https://slurm.schedmd.com/cgroup_v2.html


/Abel


On Jun 15, 2023, at 20:20, Reed Dier  wrote:

I don’t have any direct advice off-hand, but I figure I will try 
to help steer the conversation in the right direction for figuring 
it out.


I’m going to assume that since you mention 21.08.5, that this means 
you are using the slurm-wlm packages from the ubuntu repos, and not 
building yourself?


And have all the components (slurmctld(s), slurmdbd, slurmd(s)) 
been upgraded as well?


The only thing that immediately comes to mind is that I remember 
reading a good bit about Ubuntu 22.04’s use of cgroups v2, which as 
I understand it are very different from cgroups v1, and plenty of 
people have had issues with v1/v2 mismatches with slurm and other 
applications.


https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1
https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022

Hope that at least steers the conversation in a good direction.

Reed

On Jun 15, 2023, at 5:04 PM, Tim Schneider 
 wrote:


Hi,

I am maintaining the SLURM cluster of my research group. Recently 
I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am 
unable to launch jobs. When launching a job, I receive the 
following error:


/$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G 
--time=01:00:00 --pty -p amd -w cn02 --pty bash -i//

//srun: error: task 0 launch failed: Plugin initialization failed/

Strangely, I cannot find any indication of this problem in the 
logs (find the logs attached). The problem must be related to the 
task/cgroup plugin, as it does not occur when I disable it.


After reading in the documentation, I tried adding the 
/cgroup_enable=memory swapaccount=1/ kernel parameters, but the 
problem persisted.


I would be very grateful for any advice where to look since I have 
no idea how to investigate this issue further.


Thanks a lot in advance.

Best,

Tim





Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-17 Thread Tim Schneider

Hi,

I just want to wrap this up in case someone has the same issue in the 
future.


As Reed pointed out, Ubuntu 22 does not support cgroups v1 anymore. At 
the same time, the slurm-wlm package in the Ubuntu repositories uses 
cgroups v1, which makes its task/cgroup plugin incompatible with Ubuntu 22.


My solution was to build Slurm 22.05 manually, while ensuring that 
/libdbus-1-dev/ is installed (as otherwise cgroups v2 support does not 
get built). This takes a bit more time but seems to work so far.


Thanks a lot Reed & Abel for your advice!

Best,

Tim

On 6/16/23 10:42, Tim Schneider wrote:


Hi again,

I just realized that 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at 
some point that he build Slurm 22 instead of using the Ubuntu repo 
version. So I guess I will have to look into that.


Best,

Tim

On 6/16/23 10:36, Tim Schneider wrote:


Hi Abel and Reed,

thanks a lot for your quick replies!

I did indeed just install slurm-wlm from the Ubuntu repos.

Following the advice of 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1, I tried 
disabling cgroups v1 on Ubuntu, but that just leads to an error 
during startup of slurmd:


/slurmd: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so//
//slurmd: error: unable to mount freezer cgroup namespace: Invalid 
argument//

//slurmd: error: unable to create freezer cgroup namespace//
//slurmd: error: Couldn't load specified plugin name for 
proctrack/cgroup: Plugin init() callback failed//

//slurmd: error: cannot create proctrack context for proctrack/cgroup//
//slurmd: error: slurmd initialization failed/

So it seems that slurmd is using cgroups v1. This is also reflected 
in the mounts (for the output below, cgroups v1 is enabled again):


/$ mount | grep cgroup//
//cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)//
//cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)/


What is still confusing to me is that the slurmd logs indicate no 
error when I try running with cgroups v1 enabled and the error only 
appears on the slurmctld side.


Do you know how I can enable cgroups v2 in Slurm? To me it seems that 
this is what 
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 did.


Best,

Tim

On 6/16/23 03:28, abel pinto wrote:
Indeed, the issue seems to be that Ubuntu 22.04 does not support 
cgroups v1 anymore. Does SLURM support cgroupsv2? It seems so: 
https://slurm.schedmd.com/cgroup_v2.html


/Abel


On Jun 15, 2023, at 20:20, Reed Dier  wrote:

I don’t have any direct advice off-hand, but I figure I will try 
to help steer the conversation in the right direction for figuring 
it out.


I’m going to assume that since you mention 21.08.5, that this means 
you are using the slurm-wlm packages from the ubuntu repos, and not 
building yourself?


And have all the components (slurmctld(s), slurmdbd, slurmd(s)) 
been upgraded as well?


The only thing that immediately comes to mind is that I remember 
reading a good bit about Ubuntu 22.04’s use of cgroups v2, which as 
I understand it are very different from cgroups v1, and plenty of 
people have had issues with v1/v2 mismatches with slurm and other 
applications.


https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1
https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022

Hope that at least steers the conversation in a good direction.

Reed

On Jun 15, 2023, at 5:04 PM, Tim Schneider 
 wrote:


Hi,

I am maintaining the SLURM cluster of my research group. Recently 
I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am 
unable to launch jobs. When launching a job, I receive the 
following error:


/$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G 
--time=01:00:00 --pty -p amd -w cn02 --pty bash -i//

//srun: error: task 0 launch failed: Plugin initialization failed/

Strangely, I cannot find any indication of this problem in the 
logs (find the logs attached). The problem must be related to the 
task/cgroup plugin, as it does not occur when I disable it.


After reading in the documentation, I tried adding the 
/cgroup_enable=memory swapaccount=1/ kernel parameters, but the 
problem persisted.


I would be very grateful for any advice where to look since I have 
no idea how to investigate this issue further.


Thanks a lot in advance.

Best,

Tim