Hi Sean, Tried as suggested but still getting the same error. This is the node configuration visible to 'scontrol' just in case ➜ scontrol show node NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16 CPUAlloc=0 CPUTot=256 CPULoad=8.07 AvailableFeatures=ht,gpu ActiveFeatures=ht,gpu Gres=gpu:A100:8 NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2 OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu,cpu BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12 CfgTRES=cpu=256,mem=1000G,billing=256 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null)
On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scro...@unimelb.edu.au> wrote: > Hi Cristobal, > > My hunch is it is due to the default memory/CPU settings. > > Does it work if you do > > srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi > > Sean > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> * UoM notice: External email. Be cautious of links, attachments, or >> impersonation attempts * >> ------------------------------ >> Hi Community, >> These last two days I've been trying to understand what is the cause of >> the "Unable to allocate resources" error I keep getting when specifying >> --gres=... in a srun command (or sbatch). It fails with the error >> ➜ srun --gres=gpu:A100:1 nvidia-smi >> srun: error: Unable to allocate resources: Requested node configuration >> is not available >> >> log file on the master node (not the compute one) >> ➜ tail -f /var/log/slurm/slurmctld.log >> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 >> flags: state >> [2021-04-11T01:12:23.270] gres_per_node:1 node_cnt:0 >> [2021-04-11T01:12:23.270] ntasks_per_gres:65534 >> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >> job_resources info for JobId=1317 rc=-1 >> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >> job_resources info for JobId=1317 rc=-1 >> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no >> job_resources info for JobId=1317 rc=-1 >> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable in >> partition gpu >> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node >> configuration is not available >> >> If launched without --gres, it allocates all GPUs by default and >> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres >> is not specified. >> ➜ TUT04-GPU-multi git:(master) ✗ srun nvidia-smi >> Sun Apr 11 01:05:47 2021 >> >> +-----------------------------------------------------------------------------+ >> | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 >> | >> >> |-------------------------------+----------------------+----------------------+ >> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. >> ECC | >> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >> Compute M. | >> | | | >> MIG M. | >> >> |===============================+======================+======================| >> | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | >> 0 | >> | N/A 31C P0 51W / 400W | 0MiB / 40537MiB | 0% >> Default | >> | | | >> Disabled | >> .... >> .... >> >> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core CPUs, >> and the gres.conf file simply is (also tried the commented lines): >> ➜ ~ cat /etc/slurm/gres.conf >> # GRES configuration for native GPUS >> # DGX A100 8x Nvidia A100 >> #AutoDetect=nvml >> Name=gpu Type=A100 File=/dev/nvidia[0-7] >> >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 >> >> >> Some relevant parts of the slurm.conf file >> ➜ cat /etc/slurm/slurm.conf >> ... >> ## GRES >> GresTypes=gpu >> AccountingStorageTRES=gres/gpu >> DebugFlags=CPU_Bind,gres >> ... >> ## Nodes list >> ## Default CPU layout, native GPUs >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 >> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu >> ... >> ## Partitions list >> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE >> State=UP Nodes=nodeGPU01 Default=YES >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE >> State=UP Nodes=nodeGPU01 >> >> Any ideas where should I check? >> thanks in advance >> -- >> Cristóbal A. Navarro >> > -- Cristóbal A. Navarro