[slurm-users] cgroup issue on non-systemd system

2024-06-14 Thread Rafał Lalik via slurm-users
Hello, per documentation, it is possible to run slurm on non systemd system with IgnoreSystemd=yes in cgroup.conf. However I had an error with slurmd: error: common_file_write_content: unable to open '/sys/fs/cgroup/system.slice/cgroup.subtree_control' for writing: No such file or directory

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-14 Thread Ryan Cox via slurm-users
We did something like this in the past but from C.  However, modifying the features was painful if the user did any interesting syntax. What we are doing now is using --extra for that purpose.  The nodes boot up with SLURMD_OPTIONS="--extra {\\\"os\\\":\\\"rhel9\\\"}" or similar.  Users can re

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-14 Thread Laura Hild via slurm-users
I wrote a job_submit.lua also. It would append "¢os79" to the feature string unless the features already contained "el9," or if empty, set the features string to "centos79" without the ampersand. I didn't hear from any users doing anything fancy enough with their feature string for the ampersa

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-14 Thread Bill via slurm-users
We've done this though with job_submit.lua. Mostly with OS updates. We add a feature to everything then proceed. Telling users that adding a feature gets you on the "new" nodes. I can send you the snippet if you're using the job_submit.lua script. Bill On 6/14/24 2:18 PM, David Magda via s

[slurm-users] Node (anti?) Feature / attribute

2024-06-14 Thread David Magda via slurm-users
Hello, What I’m looking for is a way for a node to continue to be in the same partition, and have the same QoS(es), but only be chosen if a particular capability is being asked for. This is because we are rolling something (OS upgrade) out slowly to a small batch of nodes at first, and then mor

[slurm-users] Re: Debian RPM build for arm64?

2024-06-14 Thread Christopher Harrop - NOAA Affiliate via slurm-users
I have confirmed that the issue is Ubuntu 20.04. I used the tmate github action to get access to the Ubuntu 20.04 github arm runner and tried the steps manually one be one. It did indeed fail, almost immediately in the "debuild -b -uc -us” step. Given that the same experiment done on a Ubuntu

[slurm-users] Re: Issue with starting slurmctld

2024-06-14 Thread Timo Rothenpieler via slurm-users
On 14.06.2024 17:51, Rafał Lalik via slurm-users wrote: Hello, I have encountered issues with running slurmctld. From logs, I see these errors: [2024-06-14T17:37:57.587] slurmctld version 24.05.0 started on cluster laura [2024-06-14T17:37:57.587] error: plugin_load_from_file: dlopen(/usr/li

[slurm-users] Issue with starting slurmctld

2024-06-14 Thread Rafał Lalik via slurm-users
Hello, I have encountered issues with running slurmctld. From logs, I see these errors: [2024-06-14T17:37:57.587] slurmctld version 24.05.0 started on cluster laura [2024-06-14T17:37:57.587] error: plugin_load_from_file: dlopen(/usr/lib64/slurm/jobacct_gather_cgroup.so): /usr/lib64/slurm/jobac

[slurm-users] Re: Debian RPM build for arm64?

2024-06-14 Thread Christopher Harrop via slurm-users
The commands were grouped like that because they are part of a RUN in a Dockerfile. The build was happening on a Github Actions runner, so not so easy to just interactively run them one at a time. But, I'm pretty confident that it was the "debuild -b -uc -us" that failed. I have since gathere

[slurm-users] Re: slurmstepd: error: task_g_set_affinity: Operation not permitted

2024-06-14 Thread Christopher Harrop via slurm-users
I believe I have solved this. I changed the configuration to replace: TaskPlugin=task/affinity with: TaskPlugin=task/none In my case, the login node, the head node, and all of the compute nodes are running in their own containers. And docker compose is used to run all of those containers to

[slurm-users] Re: Limit GPU depending on type

2024-06-14 Thread Gestió Servidors via slurm-users
Hi, because of my real scenario (in mi first post I explained my testing scenario), with several differents users of differents types (researchers, university students and/or teachers, etc), I have distributed my GPUs in 3 differents partitions: * PartitionName=cuda-staff.q Nodes=gpu-[1-4]