I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re:
NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15 and I’ve got 2 jobs currently running on each node that’s available. So maybe: NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43 would work? > On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse <jh...@cornell.edu> wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > ________________________________ > > HI Tina, > Thank you so much for looking at this. > slurm 18.08.8 > > nvidia-smi topo -m > !sys GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity > GPU0 X NV2 NV2 NV2 NODE > 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 > GPU1 NV2 X NV2 NV2 NODE > 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 > GPU2 NV2 NV2 X NV2 SYS > 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43 > GPU3 NV2 NV2 NV2 X SYS > 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43 > mlx5_0 NODE NODE SYS SYS X > > I have tried in the gres.conf (without success; only 2 gpu jobs run per node; > no cpu jobs are currently running): > NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10] > NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10] > NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29] > NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29] > > I also tried your suggetions of 0-13, 14-27, and a combo. > I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, > I do get 4 jobs running per node. > > Jodie > > > On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> > wrote: > > Hi Jodie, > > what version of SLURM are you using? I'm pretty sure newer versions pick the > topology up automatically (although I'm on 18.08 so I can't verify that). > > Is what you're wanting to do - basically - forcefully feed a 'wrong' > gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've > ever tried that!). > > I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I > don't think) assign cores on the non-GPU CPU first (other people please > correct me if I'm wrong!). > > My gres.conf files get written by my config management from the GPU topology, > I don't think I've ever written one of them manually. And I've never tried to > make them anything wrong, i.e. I've never tried to deliberately give a > > The GRES conf would probably need to look something like > > Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13 > Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13 > Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13 > Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13 > > or maybe > > Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27 > Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27 > Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27 > Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27 > > to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config > makes me think there are two 14 core CPUs, so cores 0-13 would probably be > CPU1 etc?) > > (What is the actual topology of the system (according to, say 'nvidia-smi > topo -m')?) > > Tina > > On 07/08/2020 16:31, Jodie H. Sprouse wrote: >> Tina, >> Thank you. Yes, jobs will run on all 4 gpus if I submit with: >> --gres-flags=disable-binding >> Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only >> job to never run on that particular cpu (having it bound to the gpu and >> always free for a gpu job) and give the cpu job the maxcpus minus the 4. >> >> * Hyperthreading is turned on. >> NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000 >> >> PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 >> MaxTime=168:00:00 State=UP OverSubscribe=NO >> TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0" >> PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 >> MaxTime=168:00:00 State=UP OverSubscribe=NO >> TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48 >> >> I have played tried variations for gres.conf such as: >> NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2 >> NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3 >> >> as well as trying CORES= (rather than CPUSs) with NO success. >> >> >> I’ve battled this all week. Any suggestions would be greatly appreciated! >> Thanks for any suggestions! >> Jodie >> >> >> On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> >> wrote: >> >> Hello, >> >> This is something I've seen once on our systems & it took me a while to >> figure out what was going on. >> >> The solution was that the system topology was such that all GPUs were >> connected to one CPU. There were no free cores on that particular CPU; so >> SLURM did not schedule any more jobs to the GPUs. Needed to disable binding >> in job submission to schedule to all of them. >> >> Not sure that applies in your situation (don't know your system), but it's >> something to check? >> >> Tina >> >> >> On 07/08/2020 15:42, Jodie H. Sprouse wrote: >>> Good morning. >>> I have having the same experience here. Wondering if you had a resolution? >>> Thank you. >>> Jodie >>> >>> >>> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresn...@fau.edu >>> <mailto:rresn...@fau.edu>> wrote: >>> >>> We have several users submitting single GPU jobs to our cluster. We >>> expected the jobs to fill each node and fully utilize the available GPU's >>> but we instead find that only 2 out of the 4 gpu's in each node gets >>> allocated. >>> >>> If we request 2 GPU's in the job and start two jobs, both jobs will start >>> on the same node fully allocating the node. We are puzzled about is going >>> on and any hints are welcome. >>> >>> Thanks for your help, >>> >>> Rhian >>> >>> >>> >>> *Example SBATCH Script* >>> #!/bin/bash >>> #SBATCH --job-name=test >>> #SBATCH --partition=longq7-mri >>> #SBATCH -N 1 >>> #SBATCH -n 1 >>> #SBATCH --gres=gpu:1 >>> #SBATCH --mail-type=ALL >>> hostname >>> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES >>> >>> set | grep SLURM >>> nvidia-smi >>> sleep 500 >>> >>> >>> >>> >>> *gres.conf* >>> #AutoDetect=nvml >>> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0 >>> Name=gpu Type=v100 File=/dev/nvidia1 Cores=1 >>> Name=gpu Type=v100 File=/dev/nvidia2 Cores=2 >>> Name=gpu Type=v100 File=/dev/nvidia3 Cores=3 >>> >>> >>> *slurm.conf* >>> # >>> # Example slurm.conf file. Please run configurator.html >>> # (in doc/html) to build a configuration file customized >>> # for your environment. >>> # >>> # >>> # slurm.conf file generated by configurator.html. >>> # >>> # See the slurm.conf man page for more information. >>> # >>> ClusterName=cluster >>> ControlMachine=cluster-slurm1.example.com >>> <http://cluster-slurm1.example.com/> >>> ControlAddr=10.116.0.11 >>> BackupController=cluster-slurm2. >>> <http://cluster-slurm2.example.com/>example.com >>> <http://cluster-slurm2.example.com/> >>> BackupAddr=10.116.0.17 >>> # >>> SlurmUser=slurm >>> #SlurmdUser=root >>> SlurmctldPort=6817 >>> SlurmdPort=6818 >>> SchedulerPort=7321 >>> >>> RebootProgram="/usr/sbin/reboot" >>> >>> >>> AuthType=auth/munge >>> #JobCredentialPrivateKey= >>> #JobCredentialPublicCertificate= >>> StateSaveLocation=/var/spool/slurm/ctld >>> SlurmdSpoolDir=/var/spool/slurm/d >>> SwitchType=switch/none >>> MpiDefault=none >>> SlurmctldPidFile=/var/run/slurmctld.pid >>> SlurmdPidFile=/var/run/slurmd.pid >>> ProctrackType=proctrack/pgid >>> >>> GresTypes=gpu,mps,bandwidth >>> >>> PrologFlags=x11 >>> #PluginDir= >>> #FirstJobId= >>> #MaxJobCount= >>> #PlugStackConfig= >>> #PropagatePrioProcess= >>> #PropagateResourceLimits= >>> #PropagateResourceLimitsExcept= >>> #Prolog= >>> #Epilog=/etc/slurm/slurm.epilog.clean >>> #SrunProlog= >>> #SrunEpilog= >>> #TaskProlog= >>> #TaskEpilog= >>> #TaskPlugin= >>> #TrackWCKey=no >>> #TreeWidth=50 >>> #TmpFS= >>> #UsePAM= >>> # >>> # TIMERS >>> SlurmctldTimeout=300 >>> SlurmdTimeout=300 >>> InactiveLimit=0 >>> MinJobAge=300 >>> KillWait=30 >>> Waittime=0 >>> # >>> # SCHEDULING >>> SchedulerType=sched/backfill >>> #bf_interval=10 >>> #SchedulerAuth= >>> #SelectType=select/linear >>> # Cores and memory are consumable >>> #SelectType=select/cons_res >>> #SelectTypeParameters=CR_Core_Memory >>> SchedulerParameters=bf_interval=10 >>> SelectType=select/cons_res >>> SelectTypeParameters=CR_Core >>> >>> FastSchedule=1 >>> #PriorityType=priority/multifactor >>> #PriorityDecayHalfLife=14-0 >>> #PriorityUsageResetPeriod=14-0 >>> #PriorityWeightFairshare=100000 >>> #PriorityWeightAge=1000 >>> #PriorityWeightPartition=10000 >>> #PriorityWeightJobSize=1000 >>> #PriorityMaxAge=1-0 >>> # >>> # LOGGING >>> SlurmctldDebug=3 >>> SlurmctldLogFile=/var/log/slurmctld.log >>> SlurmdDebug=3 >>> SlurmdLogFile=/var/log/slurmd.log >>> JobCompType=jobcomp/none >>> #JobCompLoc= >>> # >>> # ACCOUNTING >>> #JobAcctGatherType=jobacct_gather/linux >>> #JobAcctGatherFrequency=30 >>> # >>> #AccountingStorageType=accounting_storage/slurmdbd >>> #AccountingStorageHost= >>> #AccountingStorageLoc= >>> #AccountingStoragePass= >>> #AccountingStorageUser= >>> # >>> # >>> # >>> # Default values >>> # DefMemPerNode=64000 >>> # DefCpuPerGPU=4 >>> # DefMemPerCPU=4000 >>> # DefMemPerGPU=16000 >>> >>> >>> >>> # OpenHPC default configuration >>> #TaskPlugin=task/affinity >>> TaskPlugin=task/affinity,task/cgroup >>> PropagateResourceLimitsExcept=MEMLOCK >>> TaskPluginParam=autobind=cores >>> #AccountingStorageType=accounting_storage/mysql >>> #StorageLoc=slurm_acct_db >>> >>> AccountingStorageType=accounting_storage/slurmdbd >>> AccountingStorageHost=cluster-slurmdbd1.example.com >>> <http://cluster-slurmdbd1.example.com/> >>> #AccountingStorageType=accounting_storage/filetxt >>> Epilog=/etc/slurm/slurm.epilog.clean >>> >>> >>> #PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP >>> PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 >>> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 >>> PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO >>> ExclusiveUser=NO Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025] >>> >>> >>> # Partitions >>> >>> # Group Limited Queues >>> >>> # OIT DEBUG QUEUE >>> PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP >>> AllowGroups=oit-hpc-admin >>> >>> # RNA CHEM >>> PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>> MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] >>> AllowGroups=gpu-rnachem >>> >>> # V100's >>> PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>> MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] >>> AllowGroups=gpu-mri >>> >>> # BIGDATA GRANT >>> PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>> MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 >>> AllowGroups=fau-bigdata,nsf-bigdata >>> >>> PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10 >>> AllowAccounts=ALL Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata >>> >>> # CogNeuroLab >>> PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 >>> MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP >>> Nodes=node[001-004] >>> >>> >>> # Standard queues >>> >>> # OPEN TO ALL >>> >>> #Short Queue >>> PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 >>> MaxTime=06:00:00 Priority=100 >>> Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025] >>> Default=YES >>> >>> # Medium Queue >>> PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 >>> MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100] >>> >>> # Long Queue >>> PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>> MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100] >>> >>> >>> # Interactive >>> PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 >>> MaxTime=06:00:00 Priority=101 Nodes=node[001-100] Default=No Hidden=YES >>> >>> # Nodes >>> >>> # Test nodes, (vms) >>> NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000 >>> >>> # AMD Nodes >>> NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 >>> CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436 >>> >>> # V100 MRI >>> NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 >>> RealMemory=192006 >>> >>> # GPU nodes >>> NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>> ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000 >>> NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>> ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000 >>> NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 >>> ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000 >>> NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel >>> RealMemory=128000 >>> >>> # IvyBridge nodes >>> NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750 >>> # SandyBridge node(2) >>> NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 >>> ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000 >>> # IvyBridge >>> NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750 >>> # Haswell >>> NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750 >>> >>> >>> # Node health monitoring >>> HealthCheckProgram=/usr/sbin/nhc >>> HealthCheckInterval=300 >>> ReturnToService=2 >>> >>> # Fix for X11 issues >>> X11Parameters=use_raw_hostname >>> >>> >>> >>> Rhian Resnick >>> Associate Director Research Computing >>> Enterprise Systems >>> Office of Information Technology >>> >>> Florida Atlantic University >>> 777 Glades Road, CM22, Rm 173B >>> Boca Raton, FL 33431 >>> Phone 561.297.2647 >>> Fax 561.297.0222 >>> >> > >