[slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-12 Thread John Joseph
Dear All, Good morning We have setup of slurm setup for a HPC setup of 4 node  We want to do a stress test , guidnace requested for getting a  code which can test the functionality of the SLURM efficiency.  If there is such  a program, like to try out Guidance requestedThanks Joseph john 

Re: [slurm-users] [External] Re: Troubleshooting job stuck in Pending state

2023-12-12 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Presumably what's in the squeue Reason column isn't rnough? It's not particularly informative, although it does distinguish "Resources" from "Priority", for example, and it'll also list various partition limits, e.g.

Re: [slurm-users] [External] Re: Troubleshooting job stuck in Pending state

2023-12-12 Thread Davide DelVento
I am not a Slurm expert by any stretch of the imagination, so my answer is not authoritative. That said, I am not aware of any functional equivalent for Slurm, and I would love to learn that I am mistaken! On Tue, Dec 12, 2023 at 1:39 AM Pacey, Mike wrote: > Hi Davide, > > > > The jobs do event

[slurm-users] Slurm doesn't allocate job on available MIGs

2023-12-12 Thread Tristan Gillard
Hello, we have a problem on a DGX where the 4 A100s are split into different MIGs (Multi-Instance GPUs). We use slurm to allocate jobs on partitions grouping MIGs according to their size: - prod10 for 10 x 1g10gb - prod20 for 4 x 2g20gb - prod40 for 1 x 3g40gb - prod80 for 1 x A100g80gb The pr

Re: [slurm-users] [External] Re: Troubleshooting job stuck in Pending state

2023-12-12 Thread Pacey, Mike
Hi Davide, The jobs do eventually run, but can take several minutes or sometimes several hours to switch to a running state even when there’s plenty of resources free immediately. With Grid Engine it was possible to turn on scheduling diagnostics and get a summary of the scheduler’s decisions