date:20241025

[slurm-users] jobs dropping

2024-10-25 Thread Mihai Ciubancan via slurm-users

Hello, We are trying to run some PiconGPU codes on a machine with 8x100H, susing slurm. But the jobs don't run, and are not in the queue. In slurmd logs I have: [2024-10-24T09:50:40.934] CPU_BIND: _set_batch_job_limits: Memory extracted from credential for StepId=1079.batch job_mem_limit= 64

[slurm-users] Scheduling oddity with multiple GPU types in same partition

2024-10-25 Thread Kevin M. Hildebrand via slurm-users

We have a 'gpu' partition with 30 or so nodes, some with A100s, some with H100s, and a few others. It appears that when (for example) all of the A100 GPUs are in use, if there are additional jobs requesting A100 GPUs pending, and those jobs have the highest priority in the partition, then jobs subm