Dear Slurm Users, perhaps you can help me with a problem that I am having using the Scheduler (I am new to this, so please forgive me for any stupid mistakes/misunderstandings).
I am not able to submit a Multi-Threaded MPI job on a small demo cluster that I have setup using Azure CycleCloud that uses all the 10x CPUs on my cluster, and I don’t understand why – perhaps you can explain why and how I can fix this to use all available CPUs? The hpc partition that I have setup consists of 5 nodes (Azure VM type = Standard_F2s_v2), each with 2 cpu’s (I presume that these are Hyperthreaded cores, rather than 2 cpus – but I am not certain of this)? [azccadmin@ricslurm-hpc-pg0-1 ~]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 106 model name : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz stepping : 6 microcode : 0xffffffff cpu MHz : 2793.436 cache size : 49152 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 21 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear bogomips : 5586.87 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 106 model name : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz stepping : 6 microcode : 0xffffffff cpu MHz : 2793.436 cache size : 49152 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 21 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear bogomips : 5586.87 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: This is how Slurm sees one of the nodes: [azccadmin@ricslurm-scheduler LID_CAVITY]$ scontrol show nodes NodeName=ricslurm-hpc-pg0-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.88 AvailableFeatures=cloud ActiveFeatures=cloud Gres=(null) NodeAddr=ricslurm-hpc-pg0-1 NodeHostName=ricslurm-hpc-pg0-1 Version=22.05.3 OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 RealMemory=3072 AllocMem=0 FreeMem=1854 Sockets=1 Boards=1 State=IDLE+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hpc BootTime=2022-12-12T17:42:27 SlurmdStartTime=2022-12-12T17:42:28 LastBusyTime=2022-12-12T17:52:29 CfgTRES=cpu=1,mem=3G,billing=1 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s This is the Slurm Job Control Script I have come up with to run the Vectis Job (I have set 5x Node, 1x CPU, and 2x Threads – is this right?): #!/bin/bash ## Job name #SBATCH --job-name=run-grma # ## File to write standard output and error #SBATCH --output=run-grma.out #SBATCH --error=run-grma.err # ## Partition for the cluster (you might not need that) #SBATCH --partition=hpc # ## Number of nodes #SBATCH --nodes=5 # ## Number of CPUs per nodes #SBATCH --ntasks-per-node=1 # ## Number of CPUs per task #SBATCH --cpus-per-task=2 # ## General module purge ## Initialise VECTIS 2022.3b4 if [ -d /shared/apps/RealisSimulation/2022.3/bin ] then export PATH=$PATH:/shared/apps/RealisSimulation/2022.3/bin else echo "Failed to Initialise VECTIS" fi ## Run vpre -V 2022.3 -np $SLURM_NTASKS /shared/data/LID_CAVITY/files/lid.GRD vsolve -V 2022.3 -np $SLURM_NTASKS -mpi intel_2018.4 -rdmu /shared/data/LID_CAVITY/files/lid_no_write.inp But, the submitted job will not run as it says that there is not enough CPUs. Here is the debug log from slurmctld – where you can see that it is saying the job has requested 10 CPUs (which is what I want), but the hpc partition only has 5 (which I think is wrong?): [2022-12-13T09:05:01.177] debug2: Processing RPC: REQUEST_NODE_INFO from UID=0 [2022-12-13T09:05:01.370] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from UID=20001 [2022-12-13T09:05:01.371] debug3: _set_hostname: Using auth hostname for alloc_node: ricslurm-scheduler [2022-12-13T09:05:01.371] debug3: JobDesc: user_id=20001 JobId=N/A partition=hpc name=run-grma [2022-12-13T09:05:01.371] debug3: cpus=10-4294967294 pn_min_cpus=2 core_spec=-1 [2022-12-13T09:05:01.371] debug3: Nodes=5-[5] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534 [2022-12-13T09:05:01.371] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 [2022-12-13T09:05:01.371] debug3: immediate=0 reservation=(null) [2022-12-13T09:05:01.371] debug3: features=(null) batch_features=(null) cluster_features=(null) prefer=(null) [2022-12-13T09:05:01.371] debug3: req_nodes=(null) exc_nodes=(null) [2022-12-13T09:05:01.371] debug3: time_limit=15-15 priority=-1 contiguous=0 shared=-1 [2022-12-13T09:05:01.371] debug3: kill_on_node_fail=-1 script=#!/bin/bash ## Job name #SBATCH --job-n... [2022-12-13T09:05:01.371] debug3: argv="/shared/data/LID_CAVITY/slurm-runit.sh" [2022-12-13T09:05:01.371] debug3: environment=XDG_SESSION_ID=12,HOSTNAME=ricslurm-scheduler,SELINUX_ROLE_REQUESTED=,... [2022-12-13T09:05:01.371] debug3: stdin=/dev/null stdout=/shared/data/LID_CAVITY/run-grma.out stderr=/shared/data/LID_CAVITY/run-grma.err [2022-12-13T09:05:01.372] debug3: work_dir=/shared/data/LID_CAVITY alloc_node:sid=ricslurm-scheduler:13464 [2022-12-13T09:05:01.372] debug3: power_flags= [2022-12-13T09:05:01.372] debug3: resp_host=(null) alloc_resp_port=0 other_port=0 [2022-12-13T09:05:01.372] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2022-12-13T09:05:01.372] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=5 open_mode=0 overcommit=-1 acctg_freq=(null) [2022-12-13T09:05:01.372] debug3: network=(null) begin=Unknown cpus_per_task=2 requeue=-1 licenses=(null) [2022-12-13T09:05:01.372] debug3: end_time= signal=0@0 wait_all_nodes=-1 cpu_freq= [2022-12-13T09:05:01.372] debug3: ntasks_per_node=1 ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1 [2022-12-13T09:05:01.372] debug3: mem_bind=0:(null) plane_size:65534 [2022-12-13T09:05:01.372] debug3: array_inx=(null) [2022-12-13T09:05:01.372] debug3: burst_buffer=(null) [2022-12-13T09:05:01.372] debug3: mcs_label=(null) [2022-12-13T09:05:01.372] debug3: deadline=Unknown [2022-12-13T09:05:01.372] debug3: bitflags=0x1a00c000 delay_boot=4294967294 [2022-12-13T09:05:01.372] debug3: job_submit/lua: slurm_lua_loadscript: skipping loading Lua script: /etc/slurm/job_submit.lua [2022-12-13T09:05:01.372] lua: Setting reqswitch to 1. [2022-12-13T09:05:01.372] lua: returning. [2022-12-13T09:05:01.372] debug2: _part_access_check: Job requested too many CPUs (10) of partition hpc(5) [2022-12-13T09:05:01.373] debug2: _part_access_check: Job requested too many CPUs (10) of partition hpc(5) [2022-12-13T09:05:01.373] debug2: JobId=1 can't run in partition hpc: More processors requested than permitted The job will run fine if I use the below settings (across 5 nodes, but only using one of the two CPUs on each node): ## Number of nodes #SBATCH --nodes=5 # ## Number of CPUs per nodes #SBATCH --ntasks-per-node=1 # ## Number of CPUs per task #SBATCH --cpus-per-task=1 Here is the successfully submitted Job details showing it using 5 CPU’s (only one CPU per node) across 5x Nodes: [azccadmin@ricslurm-scheduler LID_CAVITY]$ scontrol show job 3 JobId=3 JobName=run-grma UserId=azccadmin(20001) GroupId=azccadmin(20001) MCS_label=N/A Priority=4294901757 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:07:35 TimeLimit=00:15:00 TimeMin=N/A SubmitTime=2022-12-12T17:32:01 EligibleTime=2022-12-12T17:32:01 AccrueTime=2022-12-12T17:32:01 StartTime=2022-12-12T17:42:46 EndTime=2022-12-12T17:57:46 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-12-12T17:32:01 Scheduler=Main Partition=hpc AllocNode:Sid=ricslurm-scheduler:11723 ReqNodeList=(null) ExcNodeList=(null) NodeList=ricslurm-hpc-pg0-[1-5] BatchHost=ricslurm-hpc-pg0-1 NumNodes=5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=5,mem=15G,node=5,billing=5 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/shared/data/LID_CAVITY/slurm-runit.sh WorkDir=/shared/data/LID_CAVITY StdErr=/shared/data/LID_CAVITY/run-grma.err StdIn=/dev/null StdOut=/shared/data/LID_CAVITY/run-grma.out Switches=1@00:00:24 Power= What am I doing wrong here - how do I get it to run the job on both CPU’s on all 5 nodes (i.e. fully utilising the available cluster resources of 10x CPUs)? Regards Gary