Hi Quirin maybe you have this gres issue https://bugs.schedmd.com/show_bug.cgi?id=12642#c27
-- Bas van der Vlies > On 17 Oct 2021, at 16:32, Quirin Lohr <quirin.l...@in.tum.de> wrote: > > Hi, > > I just upgraded from 20.11 to 21.08.2. > > Now it seems the slurmd cannot handle my custom GRES. > I have set VRAM of the GPUs as a custom GRES, to allow users to select a GPU > with enough VRAM for their jobs. > > I defined the VRAM in gres.conf: > >> NodeName=node[1,7,9] Name=VRAM Count=24G Flags=CountOnly >> NodeName=node[2-6] Name=VRAM Count=12G Flags=CountOnly >> NodeName=node[8,10] Name=VRAM Count=16G Flags=CountOnly >> NodeName=node[11-14] Name=VRAM Count=48G Flags=CountOnly > > > > and in slurm.conf: >> AccountingStorageTRES=gres/gpu,gres/gpu:p6000,gres/gpu:titan,gres/VRAM,gres/gpu:rtx_5000,gres/gpu:rtx_6000,gres/gpu:rtx_8000,gres/gpu:rtx_a6000 >> GresTypes=gpu,VRAM >> NodeName=node1 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 >> ThreadsPerCore=1 RealMemory=230000 Weight=30 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000 >> Gres=gpu:p6000:4,VRAM:no_consume:24G >> NodeName=node2 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=490000 Weight=20 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan >> Gres=gpu:titan:7,VRAM:no_consume:12G >> NodeName=node3 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=490000 Weight=21 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan >> Gres=gpu:titan:8,VRAM:no_consume:12G >> NodeName=node4 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=490000 Weight=22 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan >> Gres=gpu:titan:8,VRAM:no_consume:12G >> NodeName=node5 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=490000 Weight=23 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan >> Gres=gpu:titan:8,VRAM:no_consume:12G >> NodeName=node6 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=490000 Weight=24 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan >> Gres=gpu:titan:8,VRAM:no_consume:12G >> NodeName=node7 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=490000 Weight=31 >> Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000 >> Gres=gpu:p6000:8,VRAM:no_consume:24G >> NodeName=node8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 >> ThreadsPerCore=1 RealMemory=360000 Weight=40 >> Feature=CPU_GEN:SKYL,CPU_SKU=GOLD-61,rtx_5000 >> Gres=gpu:rtx_5000:9,VRAM:no_consume:16G >> NodeName=node9 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 >> ThreadsPerCore=1 RealMemory=360000 Weight=50 >> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_6000 >> Gres=gpu:rtx_6000:9,VRAM:no_consume:24G >> NodeName=node10 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 >> ThreadsPerCore=1 RealMemory=360000 Weight=41 >> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_5000 >> Gres=gpu:rtx_5000:9,VRAM:no_consume:16G >> NodeName=node11 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=1500000 Weight=60 >> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000 >> Gres=gpu:rtx_8000:9,VRAM:no_consume:48G >> NodeName=node12 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=1500000 Weight=61 >> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000 >> Gres=gpu:rtx_8000:9,VRAM:no_consume:48G >> NodeName=node13 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=1500000 Weight=62 >> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000 >> Gres=gpu:rtx_8000:9,VRAM:no_consume:48G >> NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 >> ThreadsPerCore=1 RealMemory=1500000 Weight=63 >> Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000 >> Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G > > > If I want to run a job with only specifying --gpu=1 it gets executed on > node2, if I add --gres=VRAM:32G it gets scheduled to node12, but then > terminated with "Invalid generic resource (gres) specification". > > So I understand that the scheduler knows about the gres/VRAM, but the slurmd > does not. > Was there any change to this, and how can I get the old behaviour back? > > Thanks in advance > Quirin Lohr > >> srun: defined options >> srun: -------------------- -------------------- >> srun: gpus : 1 >> srun: gres : gres:VRAM:32G >> srun: verbose : 1 >> srun: -------------------- -------------------- >> srun: end of defined options >> srun: Waiting for nodes to boot (delay looping 4650 times @ 0.100000 secs x >> index) >> srun: Nodes node12 are ready for job >> srun: jobid 571261: nodes(1):`node12', cpu counts: 1(x1) >> srun: error: Unable to create step for job 571261: Invalid generic resource >> (gres) specification > > > > > sacctmgr show tres: >> Type Name ID >> -------- --------------- ------ >> cpu 1 >> mem 2 >> energy 3 >> node 4 >> billing 5 >> fs disk 6 >> vmem 7 >> pages 8 >> gres gpu 1001 >> gres gpu:p6000 1002 >> gres gpu:titanxp 1003 >> gres vram 1004 >> gres gpu:titanxpasc+ 1005 >> gres cudacores 1006 >> gres gpu:rtx5000 1007 >> gres gpu:rtx6000 1008 >> gres mps 1009 >> gres mps:rtx5000 1010 >> gres mps:rtx6000 1011 >> gres gpu:rtx8000 1012 >> gres gpu:titan 1013 >> gres gpu:rtx_5000 1014 >> gres gpu:rtx_6000 1015 >> gres gpu:rtx_8000 1016 >> gres gpu:rtx_a6000 1017 > > > > -- > Quirin Lohr > Systemadministration > Technische Universität München > Fakultät für Informatik > Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz > > Boltzmannstrasse 3 > 85748 Garching > > Tel. +49 89 289 17769 > Fax +49 89 289 17757 > > quirin.l...@in.tum.de > www.vision.in.tum.de >
smime.p7s
Description: S/MIME cryptographic signature