Hello Tina, Thank you for the suggestions and responses!!! As of right now, it seems to be working with taking off the “CPUs=“ all together from gres.conf. The original thought process was to have 4 set aside to always go to the gpu; not so sure that is necessary as long as the CPU partition can never grab more than 48. I have set MaxCPUsPerNode=48 for the cpu partition & MaxCPUsPerNode=8 for the gpu partition. More users will be getting on in the upcoming weeks; I will keep watch. Now onward to be sure I have the TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=1.0” set correctly & we do not see jobs starved out. Thank you again! Jodie
On Aug 10, 2020, at 10:31 AM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> wrote: Hello, yes, that would probably work; or simply taking the "CPUs=" off, really. However, I think what Jodie's trying to do is force all GPU jobs onto one of the CPUs; not allowing all GPU jobs to spread over all processors, regardless of afinity. Jodie - can you try if NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 gets you there? Tina On 07/08/2020 19:46, Renfro, Michael wrote: > I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= > or COREs= settings. Currently, they’re: > > NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] > COREs=0-7,9-15 > > and I’ve got 2 jobs currently running on each node that’s available. > > So maybe: > > NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43 > > would work? > >> On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse <jh...@cornell.edu> wrote: >> >> External Email Warning >> >> This email originated from outside the university. Please use caution when >> opening attachments, clicking links, or responding to requests. >> >> ________________________________ >> >> HI Tina, >> Thank you so much for looking at this. >> slurm 18.08.8 >> >> nvidia-smi topo -m >> !sys GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity >> GPU0 X NV2 NV2 NV2 NODE >> 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 >> GPU1 NV2 X NV2 NV2 NODE >> 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 >> GPU2 NV2 NV2 X NV2 SYS >> 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43 >> GPU3 NV2 NV2 NV2 X SYS >> 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43 >> mlx5_0 NODE NODE SYS SYS X >> >> I have tried in the gres.conf (without success; only 2 gpu jobs run per >> node; no cpu jobs are currently running): >> NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10] >> NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10] >> NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29] >> NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29] >> >> I also tried your suggetions of 0-13, 14-27, and a combo. >> I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, >> I do get 4 jobs running per node. >> >> Jodie >> >> >> On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> >> wrote: >> >> Hi Jodie, >> >> what version of SLURM are you using? I'm pretty sure newer versions pick the >> topology up automatically (although I'm on 18.08 so I can't verify that). >> >> Is what you're wanting to do - basically - forcefully feed a 'wrong' >> gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've >> ever tried that!). >> >> I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I >> don't think) assign cores on the non-GPU CPU first (other people please >> correct me if I'm wrong!). >> >> My gres.conf files get written by my config management from the GPU >> topology, I don't think I've ever written one of them manually. And I've >> never tried to make them anything wrong, i.e. I've never tried to >> deliberately give a >> >> The GRES conf would probably need to look something like >> >> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13 >> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13 >> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13 >> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13 >> >> or maybe >> >> Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27 >> Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27 >> Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27 >> Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27 >> >> to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config >> makes me think there are two 14 core CPUs, so cores 0-13 would probably be >> CPU1 etc?) >> >> (What is the actual topology of the system (according to, say 'nvidia-smi >> topo -m')?) >> >> Tina >> >> On 07/08/2020 16:31, Jodie H. Sprouse wrote: >>> Tina, >>> Thank you. Yes, jobs will run on all 4 gpus if I submit with: >>> --gres-flags=disable-binding >>> Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only >>> job to never run on that particular cpu (having it bound to the gpu and >>> always free for a gpu job) and give the cpu job the maxcpus minus the 4. >>> >>> * Hyperthreading is turned on. >>> NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 >>> CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000 >>> >>> PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 >>> MaxTime=168:00:00 State=UP OverSubscribe=NO >>> TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0" >>> PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 >>> MaxTime=168:00:00 State=UP OverSubscribe=NO >>> TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48 >>> >>> I have played tried variations for gres.conf such as: >>> NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2 >>> NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3 >>> >>> as well as trying CORES= (rather than CPUSs) with NO success. >>> >>> >>> I’ve battled this all week. Any suggestions would be greatly appreciated! >>> Thanks for any suggestions! >>> Jodie >>> >>> >>> On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> >>> wrote: >>> >>> Hello, >>> >>> This is something I've seen once on our systems & it took me a while to >>> figure out what was going on. >>> >>> The solution was that the system topology was such that all GPUs were >>> connected to one CPU. There were no free cores on that particular CPU; so >>> SLURM did not schedule any more jobs to the GPUs. Needed to disable binding >>> in job submission to schedule to all of them. >>> >>> Not sure that applies in your situation (don't know your system), but it's >>> something to check? >>> >>> Tina >>> >>> >>> On 07/08/2020 15:42, Jodie H. Sprouse wrote: >>>> Good morning. >>>> I have having the same experience here. Wondering if you had a resolution? >>>> Thank you. >>>> Jodie >>>> >>>> >>>> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresn...@fau.edu >>>> <mailto:rresn...@fau.edu>> wrote: >>>> >>>> We have several users submitting single GPU jobs to our cluster. We >>>> expected the jobs to fill each node and fully utilize the available GPU's >>>> but we instead find that only 2 out of the 4 gpu's in each node gets >>>> allocated. >>>> >>>> If we request 2 GPU's in the job and start two jobs, both jobs will start >>>> on the same node fully allocating the node. We are puzzled about is going >>>> on and any hints are welcome. >>>> >>>> Thanks for your help, >>>> >>>> Rhian >>>> >>>> >>>> >>>> *Example SBATCH Script* >>>> #!/bin/bash >>>> #SBATCH --job-name=test >>>> #SBATCH --partition=longq7-mri >>>> #SBATCH -N 1 >>>> #SBATCH -n 1 >>>> #SBATCH --gres=gpu:1 >>>> #SBATCH --mail-type=ALL >>>> hostname >>>> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES >>>> >>>> set | grep SLURM >>>> nvidia-smi >>>> sleep 500 >>>> >>>> >>>> >>>> >>>> *gres.conf* >>>> #AutoDetect=nvml >>>> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0 >>>> Name=gpu Type=v100 File=/dev/nvidia1 Cores=1 >>>> Name=gpu Type=v100 File=/dev/nvidia2 Cores=2 >>>> Name=gpu Type=v100 File=/dev/nvidia3 Cores=3 >>>> >>>> >>>> *slurm.conf* >>>> # >>>> # Example slurm.conf file. Please run configurator.html >>>> # (in doc/html) to build a configuration file customized >>>> # for your environment. >>>> # >>>> # >>>> # slurm.conf file generated by configurator.html. >>>> # >>>> # See the slurm.conf man page for more information. >>>> # >>>> ClusterName=cluster >>>> ControlMachine=cluster-slurm1.example.com >>>> <http://cluster-slurm1.example.com/> >>>> ControlAddr=10.116.0.11 >>>> BackupController=cluster-slurm2. >>>> <http://cluster-slurm2.example.com/>example.com >>>> <http://cluster-slurm2.example.com/> >>>> BackupAddr=10.116.0.17 >>>> # >>>> SlurmUser=slurm >>>> #SlurmdUser=root >>>> SlurmctldPort=6817 >>>> SlurmdPort=6818 >>>> SchedulerPort=7321 >>>> >>>> RebootProgram="/usr/sbin/reboot" >>>> >>>> >>>> AuthType=auth/munge >>>> #JobCredentialPrivateKey= >>>> #JobCredentialPublicCertificate= >>>> StateSaveLocation=/var/spool/slurm/ctld >>>> SlurmdSpoolDir=/var/spool/slurm/d >>>> SwitchType=switch/none >>>> MpiDefault=none >>>> SlurmctldPidFile=/var/run/slurmctld.pid >>>> SlurmdPidFile=/var/run/slurmd.pid >>>> ProctrackType=proctrack/pgid >>>> >>>> GresTypes=gpu,mps,bandwidth >>>> >>>> PrologFlags=x11 >>>> #PluginDir= >>>> #FirstJobId= >>>> #MaxJobCount= >>>> #PlugStackConfig= >>>> #PropagatePrioProcess= >>>> #PropagateResourceLimits= >>>> #PropagateResourceLimitsExcept= >>>> #Prolog= >>>> #Epilog=/etc/slurm/slurm.epilog.clean >>>> #SrunProlog= >>>> #SrunEpilog= >>>> #TaskProlog= >>>> #TaskEpilog= >>>> #TaskPlugin= >>>> #TrackWCKey=no >>>> #TreeWidth=50 >>>> #TmpFS= >>>> #UsePAM= >>>> # >>>> # TIMERS >>>> SlurmctldTimeout=300 >>>> SlurmdTimeout=300 >>>> InactiveLimit=0 >>>> MinJobAge=300 >>>> KillWait=30 >>>> Waittime=0 >>>> # >>>> # SCHEDULING >>>> SchedulerType=sched/backfill >>>> #bf_interval=10 >>>> #SchedulerAuth= >>>> #SelectType=select/linear >>>> # Cores and memory are consumable >>>> #SelectType=select/cons_res >>>> #SelectTypeParameters=CR_Core_Memory >>>> SchedulerParameters=bf_interval=10 >>>> SelectType=select/cons_res >>>> SelectTypeParameters=CR_Core >>>> >>>> FastSchedule=1 >>>> #PriorityType=priority/multifactor >>>> #PriorityDecayHalfLife=14-0 >>>> #PriorityUsageResetPeriod=14-0 >>>> #PriorityWeightFairshare=100000 >>>> #PriorityWeightAge=1000 >>>> #PriorityWeightPartition=10000 >>>> #PriorityWeightJobSize=1000 >>>> #PriorityMaxAge=1-0 >>>> # >>>> # LOGGING >>>> SlurmctldDebug=3 >>>> SlurmctldLogFile=/var/log/slurmctld.log >>>> SlurmdDebug=3 >>>> SlurmdLogFile=/var/log/slurmd.log >>>> JobCompType=jobcomp/none >>>> #JobCompLoc= >>>> # >>>> # ACCOUNTING >>>> #JobAcctGatherType=jobacct_gather/linux >>>> #JobAcctGatherFrequency=30 >>>> # >>>> #AccountingStorageType=accounting_storage/slurmdbd >>>> #AccountingStorageHost= >>>> #AccountingStorageLoc= >>>> #AccountingStoragePass= >>>> #AccountingStorageUser= >>>> # >>>> # >>>> # >>>> # Default values >>>> # DefMemPerNode=64000 >>>> # DefCpuPerGPU=4 >>>> # DefMemPerCPU=4000 >>>> # DefMemPerGPU=16000 >>>> >>>> >>>> >>>> # OpenHPC default configuration >>>> #TaskPlugin=task/affinity >>>> TaskPlugin=task/affinity,task/cgroup >>>> PropagateResourceLimitsExcept=MEMLOCK >>>> TaskPluginParam=autobind=cores >>>> #AccountingStorageType=accounting_storage/mysql >>>> #StorageLoc=slurm_acct_db >>>> >>>> AccountingStorageType=accounting_storage/slurmdbd >>>> AccountingStorageHost=cluster-slurmdbd1.example.com >>>> <http://cluster-slurmdbd1.example.com/> >>>> #AccountingStorageType=accounting_storage/filetxt >>>> Epilog=/etc/slurm/slurm.epilog.clean >>>> >>>> >>>> #PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP >>>> PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 >>>> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 >>>> PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO >>>> ExclusiveUser=NO Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025] >>>> >>>> >>>> # Partitions >>>> >>>> # Group Limited Queues >>>> >>>> # OIT DEBUG QUEUE >>>> PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP >>>> AllowGroups=oit-hpc-admin >>>> >>>> # RNA CHEM >>>> PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>>> MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] >>>> AllowGroups=gpu-rnachem >>>> >>>> # V100's >>>> PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>>> MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] >>>> AllowGroups=gpu-mri >>>> >>>> # BIGDATA GRANT >>>> PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>>> MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 >>>> AllowGroups=fau-bigdata,nsf-bigdata >>>> >>>> PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10 >>>> AllowAccounts=ALL Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata >>>> >>>> # CogNeuroLab >>>> PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 >>>> MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP >>>> Nodes=node[001-004] >>>> >>>> >>>> # Standard queues >>>> >>>> # OPEN TO ALL >>>> >>>> #Short Queue >>>> PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 >>>> MaxTime=06:00:00 Priority=100 >>>> Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025] >>>> Default=YES >>>> >>>> # Medium Queue >>>> PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 >>>> MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100] >>>> >>>> # Long Queue >>>> PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >>>> MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100] >>>> >>>> >>>> # Interactive >>>> PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 >>>> MaxTime=06:00:00 Priority=101 Nodes=node[001-100] Default=No Hidden=YES >>>> >>>> # Nodes >>>> >>>> # Test nodes, (vms) >>>> NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000 >>>> >>>> # AMD Nodes >>>> NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 >>>> CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436 >>>> >>>> # V100 MRI >>>> NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 >>>> RealMemory=192006 >>>> >>>> # GPU nodes >>>> NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>>> ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000 >>>> NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>>> ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000 >>>> NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 >>>> ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000 >>>> NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel >>>> RealMemory=128000 >>>> >>>> # IvyBridge nodes >>>> NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge >>>> RealMemory=112750 >>>> # SandyBridge node(2) >>>> NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 >>>> ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000 >>>> # IvyBridge >>>> NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge >>>> RealMemory=112750 >>>> # Haswell >>>> NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750 >>>> >>>> >>>> # Node health monitoring >>>> HealthCheckProgram=/usr/sbin/nhc >>>> HealthCheckInterval=300 >>>> ReturnToService=2 >>>> >>>> # Fix for X11 issues >>>> X11Parameters=use_raw_hostname >>>> >>>> >>>> >>>> Rhian Resnick >>>> Associate Director Research Computing >>>> Enterprise Systems >>>> Office of Information Technology >>>> >>>> Florida Atlantic University >>>> 777 Glades Road, CM22, Rm 173B >>>> Boca Raton, FL 33431 >>>> Phone 561.297.2647 >>>> Fax 561.297.0222 >>>> >>