Hello,
We are trying to run some PiconGPU codes on a machine with 8x100H,
susing slurm. But the jobs don't run, and are not in the queue. In
slurmd logs I have:
[2024-10-24T09:50:40.934] CPU_BIND: _set_batch_job_limits: Memory
extracted from credential for StepId=1079.batch job_mem_limit= 64
We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
H100s, and a few others.
It appears that when (for example) all of the A100 GPUs are in use, if
there are additional jobs requesting A100 GPUs pending, and those jobs have
the highest priority in the partition, then jobs subm