Hi, thanks for getting back to me. I have been doing some more experimenting, and I think that the issue is because the Azure VMs for my nodes are HyperThreaded.
Slurm sees the cluser as 5 nodes with 1 CPU and seems to ignore the HyperThreading - so hence Slurm sees the cluster as a 5 CPU cluster (and not 10 as I thought) - so it is correct that it can't run a 10 cpu job. Speaking with my CFD types - they say our code should not be run on HT nodes, so I have switched to a different Azure VM sku for the nodes without HT, and the CPU count in Slurm matches the count of those in the VMs. So - does Slurm actually ignore HT cores, as I am supposing? Regards Gary On Tue, 13 Dec 2022 at 15:52, Brian Andrus <toomuc...@gmail.com> wrote: > Gary, > > Well your first issue is using Cyclecloud, but that is mostly opinion :) > > Your error states there aren't enough CPUs in the partition, which means > we should take a look at the partition settings. > > Take a look at 'scontrol show partition hpc' and see how many nodes are > assigned to it. Also check the state of the nodes with 'sinfo' > > It would also be good to ensure the node settings are right. Run 'slurmd > -C' on a node and see if the output matches what is in the config. > > Brian Andrus > On 12/13/2022 1:38 AM, Gary Mansell wrote: > > Dear Slurm Users, perhaps you can help me with a problem that I am having > using the Scheduler (I am new to this, so please forgive me for any stupid > mistakes/misunderstandings). > > I am not able to submit a Multi-Threaded MPI job on a small demo cluster > that I have setup using Azure CycleCloud that uses all the 10x CPUs on my > cluster, and I don’t understand why – perhaps you can explain why and how I > can fix this to use all available CPUs? > > > > The hpc partition that I have setup consists of 5 nodes (Azure VM type = > Standard_F2s_v2), each with 2 cpu’s (I presume that these are Hyperthreaded > cores, rather than 2 cpus – but I am not certain of this)? > > > > [azccadmin@ricslurm-hpc-pg0-1 ~]$ cat /proc/cpuinfo > > processor : 0 > > vendor_id : GenuineIntel > > cpu family : 6 > > model : 106 > > model name : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz > > stepping : 6 > > microcode : 0xffffffff > > cpu MHz : 2793.436 > > cache size : 49152 KB > > physical id : 0 > > siblings : 2 > > core id : 0 > > cpu cores : 1 > > apicid : 0 > > initial apicid : 0 > > fpu : yes > > fpu_exception : yes > > cpuid level : 21 > > wp : yes > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma > cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor > lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase > bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap > clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear > > bogomips : 5586.87 > > clflush size : 64 > > cache_alignment : 64 > > address sizes : 46 bits physical, 48 bits virtual > > power management: > > > > processor : 1 > > vendor_id : GenuineIntel > > cpu family : 6 > > model : 106 > > model name : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz > > stepping : 6 > > microcode : 0xffffffff > > cpu MHz : 2793.436 > > cache size : 49152 KB > > physical id : 0 > > siblings : 2 > > core id : 0 > > cpu cores : 1 > > apicid : 1 > > initial apicid : 1 > > fpu : yes > > fpu_exception : yes > > cpuid level : 21 > > wp : yes > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma > cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor > lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase > bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap > clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear > > bogomips : 5586.87 > > clflush size : 64 > > cache_alignment : 64 > > address sizes : 46 bits physical, 48 bits virtual > > power management: > > > > This is how Slurm sees one of the nodes: > > > > [azccadmin@ricslurm-scheduler LID_CAVITY]$ scontrol show nodes > > NodeName=ricslurm-hpc-pg0-1 Arch=x86_64 CoresPerSocket=1 > > CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.88 > > AvailableFeatures=cloud > > ActiveFeatures=cloud > > Gres=(null) > > NodeAddr=ricslurm-hpc-pg0-1 NodeHostName=ricslurm-hpc-pg0-1 > Version=22.05.3 > > OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 > > RealMemory=3072 AllocMem=0 FreeMem=1854 Sockets=1 Boards=1 > > State=IDLE+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > > Partitions=hpc > > BootTime=2022-12-12T17:42:27 SlurmdStartTime=2022-12-12T17:42:28 > > LastBusyTime=2022-12-12T17:52:29 > > CfgTRES=cpu=1,mem=3G,billing=1 > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > > > This is the Slurm Job Control Script I have come up with to run the Vectis > Job (I have set 5x Node, 1x CPU, and 2x Threads – is this right?): > > > > #!/bin/bash > > > > ## Job name > > #SBATCH --job-name=run-grma > > # > > ## File to write standard output and error > > #SBATCH --output=run-grma.out > > #SBATCH --error=run-grma.err > > # > > ## Partition for the cluster (you might not need that) > > #SBATCH --partition=hpc > > # > > ## Number of nodes > > #SBATCH --nodes=5 > > # > > ## Number of CPUs per nodes > > #SBATCH --ntasks-per-node=1 > > # > > ## Number of CPUs per task > > #SBATCH --cpus-per-task=2 > > # > > > > ## General > > module purge > > > > ## Initialise VECTIS 2022.3b4 > > if [ -d /shared/apps/RealisSimulation/2022.3/bin ] > > then > > export PATH=$PATH:/shared/apps/RealisSimulation/2022.3/bin > > else > > echo "Failed to Initialise VECTIS" > > fi > > > > ## Run > > > > vpre -V 2022.3 -np $SLURM_NTASKS /shared/data/LID_CAVITY/files/lid.GRD > > vsolve -V 2022.3 -np $SLURM_NTASKS -mpi intel_2018.4 -rdmu > /shared/data/LID_CAVITY/files/lid_no_write.inp > > > > > > But, the submitted job will not run as it says that there is not enough > CPUs. > > > > Here is the debug log from slurmctld – where you can see that it is saying > the job has requested 10 CPUs (which is what I want), but the hpc partition > only has 5 (which I think is wrong?): > > > > [2022-12-13T09:05:01.177] debug2: Processing RPC: REQUEST_NODE_INFO from > UID=0 > > [2022-12-13T09:05:01.370] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB > from UID=20001 > > [2022-12-13T09:05:01.371] debug3: _set_hostname: Using auth hostname for > alloc_node: ricslurm-scheduler > > [2022-12-13T09:05:01.371] debug3: JobDesc: user_id=20001 JobId=N/A > partition=hpc name=run-grma > > [2022-12-13T09:05:01.371] debug3: cpus=10-4294967294 pn_min_cpus=2 > core_spec=-1 > > [2022-12-13T09:05:01.371] debug3: Nodes=5-[5] Sock/Node=65534 > Core/Sock=65534 Thread/Core=65534 > > [2022-12-13T09:05:01.371] debug3: > pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1 > > [2022-12-13T09:05:01.371] debug3: immediate=0 reservation=(null) > > [2022-12-13T09:05:01.371] debug3: features=(null) batch_features=(null) > cluster_features=(null) prefer=(null) > > [2022-12-13T09:05:01.371] debug3: req_nodes=(null) exc_nodes=(null) > > [2022-12-13T09:05:01.371] debug3: time_limit=15-15 priority=-1 > contiguous=0 shared=-1 > > [2022-12-13T09:05:01.371] debug3: kill_on_node_fail=-1 > script=#!/bin/bash > > > > ## Job name > > #SBATCH --job-n... > > [2022-12-13T09:05:01.371] debug3: > argv="/shared/data/LID_CAVITY/slurm-runit.sh" > > [2022-12-13T09:05:01.371] debug3: > environment=XDG_SESSION_ID=12,HOSTNAME=ricslurm-scheduler,SELINUX_ROLE_REQUESTED=,... > > [2022-12-13T09:05:01.371] debug3: stdin=/dev/null > stdout=/shared/data/LID_CAVITY/run-grma.out > stderr=/shared/data/LID_CAVITY/run-grma.err > > [2022-12-13T09:05:01.372] debug3: work_dir=/shared/data/LID_CAVITY > alloc_node:sid=ricslurm-scheduler:13464 > > [2022-12-13T09:05:01.372] debug3: power_flags= > > [2022-12-13T09:05:01.372] debug3: resp_host=(null) alloc_resp_port=0 > other_port=0 > > [2022-12-13T09:05:01.372] debug3: dependency=(null) account=(null) > qos=(null) comment=(null) > > [2022-12-13T09:05:01.372] debug3: mail_type=0 mail_user=(null) nice=0 > num_tasks=5 open_mode=0 overcommit=-1 acctg_freq=(null) > > [2022-12-13T09:05:01.372] debug3: network=(null) begin=Unknown > cpus_per_task=2 requeue=-1 licenses=(null) > > [2022-12-13T09:05:01.372] debug3: end_time= signal=0@0 > wait_all_nodes=-1 cpu_freq= > > [2022-12-13T09:05:01.372] debug3: ntasks_per_node=1 > ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1 > > [2022-12-13T09:05:01.372] debug3: mem_bind=0:(null) plane_size:65534 > > [2022-12-13T09:05:01.372] debug3: array_inx=(null) > > [2022-12-13T09:05:01.372] debug3: burst_buffer=(null) > > [2022-12-13T09:05:01.372] debug3: mcs_label=(null) > > [2022-12-13T09:05:01.372] debug3: deadline=Unknown > > [2022-12-13T09:05:01.372] debug3: bitflags=0x1a00c000 > delay_boot=4294967294 > > [2022-12-13T09:05:01.372] debug3: job_submit/lua: slurm_lua_loadscript: > skipping loading Lua script: /etc/slurm/job_submit.lua > > [2022-12-13T09:05:01.372] lua: Setting reqswitch to 1. > > [2022-12-13T09:05:01.372] lua: returning. > > [2022-12-13T09:05:01.372] debug2: _part_access_check: Job requested too > many CPUs (10) of partition hpc(5) > > [2022-12-13T09:05:01.373] debug2: _part_access_check: Job requested too > many CPUs (10) of partition hpc(5) > > [2022-12-13T09:05:01.373] debug2: JobId=1 can't run in partition hpc: More > processors requested than permitted > > > > > > The job will run fine if I use the below settings (across 5 nodes, but > only using one of the two CPUs on each node): > > > > ## Number of nodes > > #SBATCH --nodes=5 > > # > > ## Number of CPUs per nodes > > #SBATCH --ntasks-per-node=1 > > # > > ## Number of CPUs per task > > #SBATCH --cpus-per-task=1 > > > > Here is the successfully submitted Job details showing it using 5 CPU’s > (only one CPU per node) across 5x Nodes: > > > > [azccadmin@ricslurm-scheduler LID_CAVITY]$ scontrol show job 3 > > JobId=3 JobName=run-grma > > UserId=azccadmin(20001) GroupId=azccadmin(20001) MCS_label=N/A > > Priority=4294901757 Nice=0 Account=(null) QOS=(null) > > JobState=RUNNING Reason=None Dependency=(null) > > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > > RunTime=00:07:35 TimeLimit=00:15:00 TimeMin=N/A > > SubmitTime=2022-12-12T17:32:01 EligibleTime=2022-12-12T17:32:01 > > AccrueTime=2022-12-12T17:32:01 > > StartTime=2022-12-12T17:42:46 EndTime=2022-12-12T17:57:46 Deadline=N/A > > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-12-12T17:32:01 > Scheduler=Main > > Partition=hpc AllocNode:Sid=ricslurm-scheduler:11723 > > ReqNodeList=(null) ExcNodeList=(null) > > NodeList=ricslurm-hpc-pg0-[1-5] > > BatchHost=ricslurm-hpc-pg0-1 > > NumNodes=5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > > TRES=cpu=5,mem=15G,node=5,billing=5 > > Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* > > MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0 > > Features=(null) DelayBoot=00:00:00 > > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > > Command=/shared/data/LID_CAVITY/slurm-runit.sh > > WorkDir=/shared/data/LID_CAVITY > > StdErr=/shared/data/LID_CAVITY/run-grma.err > > StdIn=/dev/null > > StdOut=/shared/data/LID_CAVITY/run-grma.out > > Switches=1@00:00:24 > > Power= > > > > > What am I doing wrong here - how do I get it to run the job on both CPU’s > on all 5 nodes (i.e. fully utilising the available cluster resources of 10x > CPUs)? > > > > Regards > > > > Gary > >