Dear All, Good morning We have setup of slurm setup for a HPC setup of 4 node
We want to do a stress test , guidnace requested for getting a code which can
test the functionality of the SLURM efficiency. If there is such a program,
like to try out Guidance requestedThanks Joseph john
Presumably what's in the squeue Reason column isn't rnough? It's not
particularly informative, although it does distinguish "Resources" from
"Priority", for example, and it'll also list various partition limits, e.g.
I am not a Slurm expert by any stretch of the imagination, so my answer is
not authoritative.
That said, I am not aware of any functional equivalent for Slurm, and I
would love to learn that I am mistaken!
On Tue, Dec 12, 2023 at 1:39 AM Pacey, Mike wrote:
> Hi Davide,
>
>
>
> The jobs do event
Hello,
we have a problem on a DGX where the 4 A100s are split into different MIGs
(Multi-Instance GPUs).
We use slurm to allocate jobs on partitions grouping MIGs according to their
size:
- prod10 for 10 x 1g10gb
- prod20 for 4 x 2g20gb
- prod40 for 1 x 3g40gb
- prod80 for 1 x A100g80gb
The pr
Hi Davide,
The jobs do eventually run, but can take several minutes or sometimes several
hours to switch to a running state even when there’s plenty of resources free
immediately.
With Grid Engine it was possible to turn on scheduling diagnostics and get a
summary of the scheduler’s decisions