[slurm-users] Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Amjad Syed via slurm-users
Hello

I have the following scenario:
I need to submit a sequence of up to 400 jobs where the even jobs depend on
the preceeding odd job to finish and every odd job depends on the presence
of a file generated by the preceding even job (availability of the file for
the first of those 400 jobs is guaranteed).

If I just submit all those jobs via a loop using dependencies, then I end
up with a lot of pending jobs who might later not even run because no
output file has been produced by the preceding jobs. Is there a way to
pause the submission loop until the required file has been generated so
that at most two jobs are submitted at the same time?

Here is a sample submission script showing what I want to achieve.

for i in {1..200}; do
FILE=GHM_paramset_${i}.dat
   # How can I pause the submission loop until the FILE has been created
#if test -f "$FILE"; then
jobid4=$(sbatch --parsable --dependency=afterok:$jobid3 job4_sub $i)
jobid3=$(sbatch --parsable --dependency=afterok:$jobid4 job3_sub $i)
#fi
done


Any help will be appreciated

Amjad

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Bjørn-Helge Mevik via slurm-users
Amjad Syed via slurm-users  writes:

> I need to submit a sequence of up to 400 jobs where the even jobs depend on
> the preceeding odd job to finish and every odd job depends on the presence
> of a file generated by the preceding even job (availability of the file for
> the first of those 400 jobs is guaranteed).

How about letting each even job submit the next odd job after it has
created the file, and also the following even job, with a dependency on
the odd job?  You would obviuosly have to keep track of how many jobs
you've submitted so you can stop after 400 jobs. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Loris Bennett via slurm-users
Hi Ajad,

Amjad Syed via slurm-users  writes:

> Hello
>
> I have the following scenario:
> I need to submit a sequence of up to 400 jobs where the even jobs depend on 
> the preceeding odd job to finish and every odd job depends on the presence of 
> a
> file generated by the preceding even job (availability of the file for the 
> first of those 400 jobs is guaranteed).
>
> If I just submit all those jobs via a loop using dependencies, then I end up 
> with a lot of pending jobs who might later not even run because no output 
> file has
> been produced by the preceding jobs. Is there a way to pause the submission 
> loop until the required file has been generated so that at most two jobs are
> submitted at the same time?
>
> Here is a sample submission script showing what I want to achieve.
>
> for i in {1..200}; do
>   FILE=GHM_paramset_${i}.dat
># How can I pause the submission loop until the FILE has been created
> #if test -f "$FILE"; then
> jobid4=$(sbatch --parsable --dependency=afterok:$jobid3 job4_sub $i)
> jobid3=$(sbatch --parsable --dependency=afterok:$jobid4 job3_sub $i)
> #fi
> done
>
> Any help will be appreciated
>
> Amjad

You might find a job array useful for this (for any large number of jobs
with identical resources, using a job array also helps backfilling to
work efficiently, if you are using it).

With a job array you can specify how many jobs should run simultaneously
with the '%' notation:

  --array=1-200%2

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users

Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin to 
ensure that each user can use /tmp and it gets cleaned up after them. 
Currently, we are mapping /tmp into the nodes RAM, which means that the 
cgroups make sure that users can only use a certain amount of storage 
inside /tmp.


Now we would like to use of the node's local SSD instead of its RAM to 
hold the files in /tmp. I have seen people define local storage as GRES, 
but I am wondering how to make sure that users do not exceed the storage 
space they requested in a job. Does anyone have an idea how to configure 
local storage as a proper tracked resource?


Thanks a lot in advance!

Best,

Tim


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hi Tim,
we are using the container/tmpfs plugin to map /tmp to a local NVMe
drive which works great. I did consider setting up directory quotas. I
thought the InitScript [1] option should do the trick. Alas, I didn't
get it to work. If I remember correctly, slurm complained about the
option being present. In the end we recommend our users to make
exclusive use a node if they are going to use a lot of local scratch
space. I don't think this happens very often if at all.
Regards
magnus

[1] 
https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript


On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote:
> Hi,
> 
> In our SLURM cluster, we are using the job_container/tmpfs plugin to 
> ensure that each user can use /tmp and it gets cleaned up after them.
> Currently, we are mapping /tmp into the nodes RAM, which means that
> the 
> cgroups make sure that users can only use a certain amount of storage
> inside /tmp.
> 
> Now we would like to use of the node's local SSD instead of its RAM
> to 
> hold the files in /tmp. I have seen people define local storage as
> GRES, 
> but I am wondering how to make sure that users do not exceed the
> storage 
> space they requested in a job. Does anyone have an idea how to
> configure 
> local storage as a proper tracked resource?
> 
> Thanks a lot in advance!
> 
> Best,
> 
> Tim
> 
> 

-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Mitte
BALTIC - Invalidenstraße 120/121
10115 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users

Hi Magnus,

thanks for your reply! If you can, would you mind sharing the InitScript 
of your attempt at getting it to work?


Best,

Tim

On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:

Hi Tim,
we are using the container/tmpfs plugin to map /tmp to a local NVMe
drive which works great. I did consider setting up directory quotas. I
thought the InitScript [1] option should do the trick. Alas, I didn't
get it to work. If I remember correctly, slurm complained about the
option being present. In the end we recommend our users to make
exclusive use a node if they are going to use a lot of local scratch
space. I don't think this happens very often if at all.
Regards
magnus

[1]
https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript


On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote:

Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin to
ensure that each user can use /tmp and it gets cleaned up after them.
Currently, we are mapping /tmp into the nodes RAM, which means that
the
cgroups make sure that users can only use a certain amount of storage
inside /tmp.

Now we would like to use of the node's local SSD instead of its RAM
to
hold the files in /tmp. I have seen people define local storage as
GRES,
but I am wondering how to make sure that users do not exceed the
storage
space they requested in a job. Does anyone have an idea how to
configure
local storage as a proper tracked resource?

Thanks a lot in advance!

Best,

Tim




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Hagdorn, Magnus Karl Moritz via slurm-users
Hi Tim,
in the end the InitScript didn't contain anything useful because 

slurmd: error: _parse_next_key: Parsing error at unrecognized key:
InitScript

At this stage I gave up. This was with SLURM 23.02. My plan was to
setup the local scratch directory with XFS and then get the script to
apply a project quota, ie quota attached to the directory.

I would start by checking if slurm recognises the InitScript option. 

Regards
magnus

On Tue, 2024-02-06 at 15:24 +0100, Tim Schneider wrote:
> Hi Magnus,
> 
> thanks for your reply! If you can, would you mind sharing the
> InitScript 
> of your attempt at getting it to work?
> 
> Best,
> 
> Tim
> 
> On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:
> > Hi Tim,
> > we are using the container/tmpfs plugin to map /tmp to a local NVMe
> > drive which works great. I did consider setting up directory
> > quotas. I
> > thought the InitScript [1] option should do the trick. Alas, I
> > didn't
> > get it to work. If I remember correctly, slurm complained about the
> > option being present. In the end we recommend our users to make
> > exclusive use a node if they are going to use a lot of local
> > scratch
> > space. I don't think this happens very often if at all.
> > Regards
> > magnus
> > 
> > [1]
> > https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript
> > 
> > 
> > On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users
> > wrote:
> > > Hi,
> > > 
> > > In our SLURM cluster, we are using the job_container/tmpfs plugin
> > > to
> > > ensure that each user can use /tmp and it gets cleaned up after
> > > them.
> > > Currently, we are mapping /tmp into the nodes RAM, which means
> > > that
> > > the
> > > cgroups make sure that users can only use a certain amount of
> > > storage
> > > inside /tmp.
> > > 
> > > Now we would like to use of the node's local SSD instead of its
> > > RAM
> > > to
> > > hold the files in /tmp. I have seen people define local storage
> > > as
> > > GRES,
> > > but I am wondering how to make sure that users do not exceed the
> > > storage
> > > space they requested in a job. Does anyone have an idea how to
> > > configure
> > > local storage as a proper tracked resource?
> > > 
> > > Thanks a lot in advance!
> > > 
> > > Best,
> > > 
> > > Tim
> > > 
> > > 

-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Mitte
BALTIC - Invalidenstraße 120/121
10115 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users

Hi Magnus,

I understand. Thanks a lot for your suggestion.

Best,

Tim

On 06.02.24 15:34, Hagdorn, Magnus Karl Moritz wrote:

Hi Tim,
in the end the InitScript didn't contain anything useful because

slurmd: error: _parse_next_key: Parsing error at unrecognized key:
InitScript

At this stage I gave up. This was with SLURM 23.02. My plan was to
setup the local scratch directory with XFS and then get the script to
apply a project quota, ie quota attached to the directory.

I would start by checking if slurm recognises the InitScript option.

Regards
magnus

On Tue, 2024-02-06 at 15:24 +0100, Tim Schneider wrote:

Hi Magnus,

thanks for your reply! If you can, would you mind sharing the
InitScript
of your attempt at getting it to work?

Best,

Tim

On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote:

Hi Tim,
we are using the container/tmpfs plugin to map /tmp to a local NVMe
drive which works great. I did consider setting up directory
quotas. I
thought the InitScript [1] option should do the trick. Alas, I
didn't
get it to work. If I remember correctly, slurm complained about the
option being present. In the end we recommend our users to make
exclusive use a node if they are going to use a lot of local
scratch
space. I don't think this happens very often if at all.
Regards
magnus

[1]
https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript


On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users
wrote:

Hi,

In our SLURM cluster, we are using the job_container/tmpfs plugin
to
ensure that each user can use /tmp and it gets cleaned up after
them.
Currently, we are mapping /tmp into the nodes RAM, which means
that
the
cgroups make sure that users can only use a certain amount of
storage
inside /tmp.

Now we would like to use of the node's local SSD instead of its
RAM
to
hold the files in /tmp. I have seen people define local storage
as
GRES,
but I am wondering how to make sure that users do not exceed the
storage
space they requested in a job. Does anyone have an idea how to
configure
local storage as a proper tracked resource?

Thanks a lot in advance!

Best,

Tim




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Restricting local disk storage of jobs

2024-02-06 Thread Jeffrey T Frey via slurm-users
Most of my ideas have revolved around creating file systems on-the-fly as part 
of the job prolog and destroying them in the epilog.  The issue with that 
mechanism is that formatting a file system (e.g. mkfs.) can be 
time-consuming.  E.g. formatting your local scratch SSD as an LVM PV+VG and 
allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and mount 
the new file system.


ZFS file system creation is much quicker (basically combines the LVM + mkfs 
steps above) but I don't know of any clusters using ZFS to manage local file 
systems on the compute nodes :-)


One could leverage XFS project quotas.  E.g. for Slurm job 2147483647:


[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647
[root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 
2147483647' /tmp-alloc
Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)...
Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with 
recursion depth infinite (-1).
[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc
[root@r00n00 /]# cd /tmp-alloc/slurm-2147483647
[root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000
dd: error writing ‘zeroes’: No space left on device
205+0 records in
204+0 records out
1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s

   :

[root@r00n00 /]# rm -rf /tmp-alloc/slurm-2147483647
[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc


Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) we have 
an easy on-demand project id to use on the file system.  Slurm tmpfs plugins 
have to do a mkdir to create the per-job directory, adding two xfs_quota 
commands (which run in more or less O(1) time) won't extend the prolog by much. 
Likewise, Slurm tmpfs plugins have to scrub the directory at job cleanup, so 
adding another xfs_quota command will not do much to change their epilog 
execution times.  The main question is "where does the tmpfs plugin find the 
quota limit for the job?"





> On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users 
>  wrote:
> 
> Hi,
> 
> In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure 
> that each user can use /tmp and it gets cleaned up after them. Currently, we 
> are mapping /tmp into the nodes RAM, which means that the cgroups make sure 
> that users can only use a certain amount of storage inside /tmp.
> 
> Now we would like to use of the node's local SSD instead of its RAM to hold 
> the files in /tmp. I have seen people define local storage as GRES, but I am 
> wondering how to make sure that users do not exceed the storage space they 
> requested in a job. Does anyone have an idea how to configure local storage 
> as a proper tracked resource?
> 
> Thanks a lot in advance!
> 
> Best,
> 
> Tim
> 
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] scheme for protected GPU jobs from preemption

2024-02-06 Thread Paul Raines via slurm-users



After using just Fairshare for over a year on our GPU cluster, we
have decided it is not working for us for what we really want
to achieve among our groups.  We have decided to look at preemption.

What we want is for users to NOT have a #job/GPU maximum (if they are
only person on the cluster they should be able to use it all), but
if another user comes to the "full" cluster they should immediately
be able to run some jobs.  Thus preemption is needed.

In our scheme we want

* users to have N protected GPU jobs that cannot be preempted
  where N is the number of GPUs allocated.

* N may not be the same for all users.  Some priviledged users get more.

* jobs pending in the queue will have lower
  priority dependent on the number of GPUs allocated to running jobs.
  Maybe doable somehow with PriorityWeightJobSize though not sure how.

* Jobs over N are subject to preemption (and requeued if --requeue is
  given) with shortest running jobs of the user with most unprotected GPUs
  preempted first.

* another complication is we have a variety of different GPUs and
  users may ask for specific ones which can limit what
  unprotected GPU jobs are available for preemption

My first attempt to do this in SLURM was to just create two partitions, 
GPU and GPU-req, with different PriorityTier values and the later 
partition have PreemptMode=REQUEUE.  But N would be set by a MaxTRES on 
the first partition and be the same for everyone and we need it to be 
INDEPENDENT for each user.


Also users would have to "think" about which partition to submit jobs to. 
And users want their longest running "unprotected" job to be able to be 
PROMOTED to a "protected" jobs automatically when a "protected" job 
finishes.  However slurm does not allow running jobs to move between 
partitions.


I am trying to figure out QOS pre-emption which might solve the 
independent N per user issue but I don't think it will solve the promotion 
issue.


Any ideas how this scheme might be possible in SLURM?

Otherwise I might have to write a complicated cron job that
tries to do it all "outside" of SLURM issuing scontrol commands.

---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA



The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Is there a way to list allocated/unallocated resources defined in a QoS?

2024-02-06 Thread Alastair Neil via slurm-users
Slurm version 23.02.07
If I have a QoS defined that has a set number of say GPU devices set in the
GrpTRES.  Is there an easy way to generate a list of how much of the
defined quota is allocated or conversely un-allocated?

e.g.:

Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
normal|0|00:00:00|||cluster|||1.00|||cpu=3000,gres/gpu=20|||
dept1|1|00:00:00|||cluster|||1.00|cpu=256,gres/gpu:1g.10gb=16,gres/gpu:2g.20gb=8,gres/gpu:3g.40gb=8,gres/gpu:a100.80gb=8|
dept2|1|00:00:00|||cluster|||1.00|cpu=256,gres/gpu:1g.10gb=0,gres/gpu:2g.20gb=0,gres/gpu:3g.40gb=0,gres/gpu:a100.80gb=16|


So dept1 and dept2 qos are set on the same partition. How can a user with
access to one or other see if  there are available resources in the
partition?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Could not find group with gid even when they exist

2024-02-06 Thread Nic Lewis via slurm-users
After upgrading to version 23.11.3 we started to get slammed with the following 
log messages from slurmctld

"error: validate_group: Could not find group with gid "

This spans a handful of groups and repeats constantly, drowning out just about 
everything else. Attempting to do a lookup on the group shows that they exist 
on the scheduler node, same for all the submission and compute nodes. As far as 
I can tell, slurm should be able to locate the group in question. 

Jobs submitted from users within those groups go through just fine. They get 
scheduled, run, and clean up no problem. I'm at a loss on where to look next.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com