Another update: Sorry, my bad! This is already part of the Gres documentation:
""" For Type to match a system-detected device, it must either exactly match or be a substring of the GPU name reported by slurmd via the AutoDetect mechanism. This GPU name will have all spaces replaced with underscores. To see the GPU name, set SlurmdDebug=debug2 in your slurm.conf and either restart or reconfigure slurmd and check the slurmd log. """ Only thing that is still not clear to me is that it also doesn't work if I remove the AutoDetect=nvml line from gres.conf. Cheers, and have a nice weekend Esben ________________________________ From: EPF (Esben Peter Friis) <e...@novozymes.com> Sent: Thursday, January 5, 2023 17:14 To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: Re: Sharding not working correctly if several gpu types are defined Update: If I call the smaller card "Quadro" rather that "RTX5000", is works correctly in slurm.comf: NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:Quadro:1,shard:88 Feature=gpu,ht is gres.conf: AutoDetect=nvml Name=gpu Type=A5000 File=/dev/nvidia0 Name=gpu Type=A5000 File=/dev/nvidia1 Name=gpu Type=Quadro File=/dev/nvidia2 Name=gpu Type=A5000 File=/dev/nvidia3 Name=shard Count=24 File=/dev/nvidia0 Name=shard Count=24 File=/dev/nvidia1 Name=shard Count=16 File=/dev/nvidia2 Name=shard Count=24 File=/dev/nvidia3 Does the name string have to be (part of) what nvidia-smi or NVML reports? Cheers, Esben ________________________________ From: EPF (Esben Peter Friis) <e...@novozymes.com> Sent: Thursday, January 5, 2023 16:51 To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: Sharding not working correctly if several gpu types are defined Really great that there is now a way to share GPUs between several jobs - even with several GPUs per host. Thanks for adding this feature! I have compiled (against cuda 11.8) and installed 22.05.7. The test system is one host with 4 GPUS (3 x Nvidia A5000 + 1 x Nivida RTX5000) nvidia-smi reports this: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A5000 On | 00000000:02:00.0 Off | Off | | 42% 62C P2 88W / 230W | 207MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A5000 On | 00000000:03:00.0 Off | Off | | 45% 61C P5 80W / 230W | 3MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Quadro RTX 5000 On | 00000000:83:00.0 Off | Off | | 51% 63C P0 67W / 230W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A5000 On | 00000000:84:00.0 Off | Off | | 31% 52C P0 64W / 230W | 3MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ My gres.conf is this. The RTX5000 has less memory, so we configure it with less shards: AutoDetect=nvml Name=gpu Type=A5000 File=/dev/nvidia0 Name=gpu Type=A5000 File=/dev/nvidia1 Name=gpu Type=RTX5000 File=/dev/nvidia2 Name=gpu Type=A5000 File=/dev/nvidia3 Name=shard Count=24 File=/dev/nvidia0 Name=shard Count=24 File=/dev/nvidia1 Name=shard Count=16 File=/dev/nvidia2 Name=shard Count=24 File=/dev/nvidia3 if I don't configure gpus by type - like this in slurm.conf: NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:4,shard:88 Feature=gpu,ht and run 7 jobs, each requesting 12 shards, it works exacly as expected: 2 jobs on each of the A5000's and one job on the RTX5000. (Subsequent jobs requesting 12 shards are correctly queued) +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB | | 0 N/A N/A 1160663 C ...-2020-ubuntu20.04/bin/gmx 260MiB | | 0 N/A N/A 1160758 C ...-2020-ubuntu20.04/bin/gmx 254MiB | | 1 N/A N/A 1160643 C ...-2020-ubuntu20.04/bin/gmx 262MiB | | 1 N/A N/A 1160647 C ...-2020-ubuntu20.04/bin/gmx 256MiB | | 2 N/A N/A 1160659 C ...-2020-ubuntu20.04/bin/gmx 174MiB | | 3 N/A N/A 1160644 C ...-2020-ubuntu20.04/bin/gmx 248MiB | | 3 N/A N/A 1160755 C ...-2020-ubuntu20.04/bin/gmx 260MiB | +-----------------------------------------------------------------------------+ That's great! If we run jobs requiring one or more full GPUs, ee would like to be able to request specific GPU types as well But if I configure the gpus also by name like this in slurm.conf: NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:RTX5000:1,shard:88 Feature=gpu,ht and run 7 jobs, each requesting 12 shards, It does NOT Work. It starts 2 jobs on the first two A5000's, two job on the RTX5000, and one job on last A5000. Strangely, it still knows that it should not start more jobs - subsequent jobs are still queued. +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB | | 0 N/A N/A 1176564 C ...-2020-ubuntu20.04/bin/gmx 258MiB | | 0 N/A N/A 1176565 C ...-2020-ubuntu20.04/bin/gmx 258MiB | | 1 N/A N/A 1176562 C ...-2020-ubuntu20.04/bin/gmx 258MiB | | 1 N/A N/A 1176566 C ...-2020-ubuntu20.04/bin/gmx 258MiB | | 2 N/A N/A 1176560 C ...-2020-ubuntu20.04/bin/gmx 172MiB | | 2 N/A N/A 1176561 C ...-2020-ubuntu20.04/bin/gmx 172MiB | | 3 N/A N/A 1176563 C ...-2020-ubuntu20.04/bin/gmx 258MiB | +-----------------------------------------------------------------------------+ It is also strange that "scontrol show node" seems to list the shards correctly, even in this case: NodeName=koala Arch=x86_64 CoresPerSocket=14 CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=22.16 AvailableFeatures=gpu,ht ActiveFeatures=gpu,ht Gres=gpu:A5000:3(S:0-1),gpu:RTX5000:1(S:0-1),shard:A5000:72(S:0-1),shard:RTX5000:16(S:0-1) NodeAddr=10.194.132.190 NodeHostName=koala Version=22.05.7 OS=Linux 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022 RealMemory=1 AllocMem=0 FreeMem=390036 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=urgent,high,medium,low BootTime=2023-01-03T12:37:17 SlurmdStartTime=2023-01-05T16:24:53 LastBusyTime=2023-01-05T16:37:24 CfgTRES=cpu=56,mem=1M,billing=56,gres/gpu=4,gres/shard=88 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s In all cases, my jobs are submitted with commands like this: sbatch --gres=shard:12 --wrap 'bash -c " ... (command goes here) ... "' The behavior is very consistent. I have played around with adding CUDA_DEVICE_ORDER=PCI_BUS_ID to the environment of slurmd and slurmctld, but it makes no difference. Is this a bug or a feature? Cheers, Esben