[slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-25 Thread Guillaume Perrault Archambault
Hello, I wrote a regression-testing toolkit to manage large numbers of SLURM jobs and their output (the toolkit can be found here if anyone is interested). To make job launching faster, sbatch commands are forked, so that numerous jobs may be

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
> problem. > > -Paul Edmon- > On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote: > > Hello, > > I wrote a regression-testing toolkit to manage large numbers of SLURM jobs > and their output (the toolkit can be found here > <https://github.com/gobbedy/slurm_si

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
possible). > > You should also read the Large Cluster Administration Guide at > https://slurm.schedmd.com/big_sys.html > > Furthermore, it may perhaps be a good idea to have the MySQL database > server installed on a separate server so that it doesn't slow down the > slurmctld.

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
s which then the scheduler can > handle sensibly. So I highly recommend using job arrays. > > -Paul Edmon- > On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote: > > Hi Paul, > > Thanks a lot for your suggestion. > > The cluster I'm using has thousands of users,

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
want to look into slurmdbd and sacct > > Then you can create a qos that has MaxJobsPerUser to limit the total > number running on a per-user basis: > https://slurm.schedmd.com/resource_limits.html > > Brian Andrus > On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote: > &g

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Guillaume Perrault Archambault
nistrator can create a QOS that explicitly limits the user. > However, you keep saying that they probably won't modify the system > for just you... > > That seems to me to be the perfect case to use array jobs and tell it > how many elements of the array to run at once. > You&#

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
t; picture view of things (I've never been an admin, most notably), so feel > free to poke holes at the way I've constructed things. > > > > Regards, > > Guillaume. > > > > > > On Fri, Aug 30, 2019 at 1:22 AM Steven Dick wrote: > >> > &

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
limit (e.g., > > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when > > starting it to set the overall allocation for all the jobs. The admin > > woudln't need to tweak this except once, you just pick which tweak to > > use. > > > >

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
ic. > > -Paul Edmon- > On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote: > > Hi Paul, > > Thanks for your pointers. > > I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as > expressed to Peter in the reply I just now sent, I wonder i

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-31 Thread Guillaume Perrault Archambault
Hi Steven, Thanks for your help. Looks like QOS is the way to go if I want both job arrays + user limits on jobs/resources (in the context of a regression-test). Regards, Guillaume. On Fri, Aug 30, 2019 at 6:11 PM Steven Dick wrote: > On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perra