[slurm-users] Could not find group with gid even when they exist

2024-02-06 Thread Nic Lewis via slurm-users
After upgrading to version 23.11.3 we started to get slammed with the following log messages from slurmctld "error: validate_group: Could not find group with gid " This spans a handful of groups and repeats constantly, drowning out just about everything else. Attempting to do a lookup on the gr

[slurm-users] Is there a way to list allocated/unallocated resources defined in a QoS?

2024-02-06 Thread Alastair Neil via slurm-users
Slurm version 23.02.07 If I have a QoS defined that has a set number of say GPU devices set in the GrpTRES. Is there an easy way to generate a list of how much of the defined quota is allocated or conversely un-allocated? e.g.: Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|

[slurm-users] scheme for protected GPU jobs from preemption

2024-02-06 Thread Paul Raines via slurm-users
After using just Fairshare for over a year on our GPU cluster, we have decided it is not working for us for what we really want to achieve among our groups. We have decided to look at preemption. What we want is for users to NOT have a #job/GPU maximum (if they are only person on the cluster t

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-06 Thread Jeffrey T Frey via slurm-users
Most of my ideas have revolved around creating file systems on-the-fly as part of the job prolog and destroying them in the epilog. The issue with that mechanism is that formatting a file system (e.g. mkfs.) can be time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and all

[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users
Hi Magnus, I understand. Thanks a lot for your suggestion. Best, Tim On 06.02.24 15:34, Hagdorn, Magnus Karl Moritz wrote: Hi Tim, in the end the InitScript didn't contain anything useful because slurmd: error: _parse_next_key: Parsing error at unrecognized key: InitScript At this stage I g

[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hi Tim, in the end the InitScript didn't contain anything useful because slurmd: error: _parse_next_key: Parsing error at unrecognized key: InitScript At this stage I gave up. This was with SLURM 23.02. My plan was to setup the local scratch directory with XFS and then get the script to apply a

[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users
Hi Magnus, thanks for your reply! If you can, would you mind sharing the InitScript of your attempt at getting it to work? Best, Tim On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote: Hi Tim, we are using the container/tmpfs plugin to map /tmp to a local NVMe drive which works great. I d

[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hi Tim, we are using the container/tmpfs plugin to map /tmp to a local NVMe drive which works great. I did consider setting up directory quotas. I thought the InitScript [1] option should do the trick. Alas, I didn't get it to work. If I remember correctly, slurm complained about the option being p

[slurm-users] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users
Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /

[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Loris Bennett via slurm-users
Hi Ajad, Amjad Syed via slurm-users writes: > Hello > > I have the following scenario: > I need to submit a sequence of up to 400 jobs where the even jobs depend on > the preceeding odd job to finish and every odd job depends on the presence of > a > file generated by the preceding even job (a

[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Bjørn-Helge Mevik via slurm-users
Amjad Syed via slurm-users writes: > I need to submit a sequence of up to 400 jobs where the even jobs depend on > the preceeding odd job to finish and every odd job depends on the presence > of a file generated by the preceding even job (availability of the file for > the first of those 400 jobs

[slurm-users] Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Amjad Syed via slurm-users
Hello I have the following scenario: I need to submit a sequence of up to 400 jobs where the even jobs depend on the preceeding odd job to finish and every odd job depends on the presence of a file generated by the preceding even job (availability of the file for the first of those 400 jobs is gua