Re: [slurm-users] Job continuing to use cpu minutes after completion

2023-02-03 Thread Jonathan Casco
Hi Reed, Thank you for that information. I gave the requeue a try however it did not work as the scheduler did not recognize the job ID. # scontrol requeue 8853669_3 8853669_3: Invalid job id specified I tried with a few other job steps but saw the same error. It looks like the scheduler is not

Re: [slurm-users] Job continuing to use cpu minutes after completion

2023-02-03 Thread Reed Dier
This sounds similar to something I recently experienced and finally figured out in 21.08. https://lists.schedmd.com/pipermail/slurm-users/2023-January/009594.html The long and short of it, is that I had jobs with the clo

Re: [slurm-users] GPU: MPS vs Sharding

2023-02-03 Thread EPF (Esben Peter Friis)
MPS only works for the first GPU in a system. If you have a server with multiple GPUs, you can only share the first GPU between multiple jobs. Sharding, on the other hand, works for all GPU's in system. Not that sharding is soft, Slurm will not monitor the actual GPU use, so jobs will have to

[slurm-users] Job continuing to use cpu minutes after completion

2023-02-03 Thread Jonathan Casco
Hello, We are using Slurm 22.05.6 and have encountered a strange issue with one users jobs where they submitted a job array. The jobs failed and left the queue in the logs but have continued to use CPU minutes well past the job completion. I am using one step as an example here but this is occu

[slurm-users] Advanced reservations inc memory

2023-02-03 Thread Will Furnass
Hi all, Seems that memory can't be reserved as part of advanced reservations: $ sudo -i scontrol create reservationname=testres1 start=now duration=259200 account=testacct tres=node=1,cpu=20,mem=96G scontrol: error: TRES type 'mem' not supported with reservations $ srun --version slurm 22.05.6

Re: [slurm-users] Enforce gpu usage limits (with GRES?)

2023-02-03 Thread Markus Kötter
Hi, limits ain't easy. https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmLimits.html#precedence I think there is multiple options, starting with not having GPU resources in the CPU partition. Or creating qos the partition and have MaxTRES=gres/gpu:A100=0,gres/gpu:K80=0,gres/g