Hello everybody, I have a problem with my gpu nodes, which I believe could be rather special. This is slurm 23.02.2.
I tried to add new gres to the nodes, which did not work at first. After quite some prodding I have all gres, but now my GPU to CPU affinity (?) or at least the scontrol gres definition of one node looks odd. There are four gpu nodes, and one of them (gpu001, on which I did some trial and error experiments) is now different from the others. $ scontrol show node=gpu001|grep Gres Gres=gpu:a100:4,localdisk:490,ramdisk:100 $ scontrol show node=gpu002|grep Gres Gres=gpu:a100:4(S:0-1),localdisk:490,ramdisk:100 Notice the missing socket indicator (S:0-1) in the scontrol output. Running slurmctld with DebugFlags=Gres shows further (abbreviated): [2023-05-21T14:49:50.503] gres/gpu: state for gpu001 [2023-05-21T14:49:50.503] gres_cnt found:4 configured:4 avail:4 alloc:0 [2023-05-21T14:49:50.503] gres_bit_alloc: of 4 [2023-05-21T14:49:50.503] gres_used:(null) [2023-05-21T14:49:50.503] topo[0]:a100(808464737) [2023-05-21T14:49:50.503] topo_core_bitmap[0]:NULL while for the other gpu nodes [2023-05-21T14:49:50.506] gres/gpu: state for gpu002 [2023-05-21T14:49:50.506] gres_cnt found:4 configured:4 avail:4 alloc:0 [2023-05-21T14:49:50.506] gres_bit_alloc: of 4 [2023-05-21T14:49:50.506] gres_used:(null) [2023-05-21T14:49:50.506] topo[0]:a100(808464737) [2023-05-21T14:49:50.506] topo_core_bitmap[0]:36-47 of 48 Notice the different topo_core_bitmap field. The nodes are stateless (ramdisk based) and have been rebooted, slurmctld, slurmdbd and slurmd's have been restarted multiple times, and scontrol reconfigure has been executed multiple times. I have no clear idea how they ended up in this state and also nothing I try appears to be able to "synchronize" them again. Especially changing the "Cores=" values in gres.conf changes neither topo_core_bitmap field. Of course I naively expect that it should change. It is as if the "Cores=" values are ignored. Note that AutoDetect should not work on our system, slurm has not been built with nvml support. Any ideas ? Also, please let me know if further information is needed. Our gres.conf: NodeName=gpu[001-004] Name=gpu Type=a100 File=/dev/nvidia0 Cores=0-11 NodeName=gpu[001-004] Name=gpu Type=a100 File=/dev/nvidia1 Cores=12-23 NodeName=gpu[001-004] Name=gpu Type=a100 File=/dev/nvidia2 Cores=24-35 NodeName=gpu[001-004] Name=gpu Type=a100 File=/dev/nvidia3 Cores=36-43 NodeName=gpu[001-004] Name=localdisk Flags=CountOnly NodeName=gpu[001-004] Name=ramdisk Flags=CountOnly And extract from slurm.conf: # grep gpu slurm.conf GresTypes=gpu,localdisk,ramdisk AccountingStorageTRES=gres/gpu,gres/gpu:a100,gres/localdisk,gres/ramdisk NodeName=gpu[001-004] Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=500000 Feature=cpu6342,ram512,gpuA100 Gres=gpu:a100:4,localdisk:490,ramdisk:100 PartitionName=gpu Nodes=gpu[001-004] MaxTime=72:00:00 OverTimeLimit=5 DefMemPerCPU=4096 MaxMemPerCPU=7500 State=UP Best Regards Christof -- Dr. rer. nat. Christof Köhler email: c.koeh...@uni-bremen.de Universitaet Bremen/FB1/BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.06 fax: +49-(0)421-218-62770 28359 Bremen