Hello,

yes, that would probably work; or simply taking the "CPUs=" off, really.

However, I think what Jodie's trying to do is force all GPU jobs onto one of the CPUs; not allowing all GPU jobs to spread over all processors, regardless of afinity.

Jodie - can you try if

NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] 
CPUs=0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42

gets you there?

Tina

On 07/08/2020 19:46, Renfro, Michael wrote:
I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or 
COREs= settings. Currently, they’re:

   NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] 
COREs=0-7,9-15

and I’ve got 2 jobs currently running on each node that’s available.

So maybe:

   NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43

would work?

On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse <jh...@cornell.edu> wrote:

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________

HI Tina,
Thank you so much for looking at this.
slurm 18.08.8

nvidia-smi topo -m
!sys    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
GPU0     X      NV2     NV2     NV2     NODE    
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU1    NV2      X      NV2     NV2     NODE    
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU2    NV2     NV2      X      NV2     SYS     
1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
GPU3    NV2     NV2     NV2      X      SYS     
1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
mlx5_0  NODE    NODE    SYS     SYS      X

I have tried in the gres.conf (without success; only 2 gpu jobs run per node; 
no cpu jobs are currently running):
NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10]
NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10]
NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29]
NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]

I also tried your suggetions of 0-13, 14-27, and a combo.
I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I 
do get 4 jobs running per node.

Jodie


On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> wrote:

Hi Jodie,

what version of SLURM are you using? I'm pretty sure newer versions pick the 
topology up automatically (although I'm on 18.08 so I can't verify that).

Is what you're wanting to do - basically - forcefully feed a 'wrong' gres.conf 
to make SLURM assume all GPUs are on one CPU? (I don't think I've ever tried 
that!).

I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I 
don't think) assign cores on the non-GPU CPU first (other people please correct 
me if I'm wrong!).

My gres.conf files get written by my config management from the GPU topology, I 
don't think I've ever written one of them manually. And I've never tried to 
make them anything wrong, i.e. I've never tried to deliberately give a

The GRES conf would probably need to look something like

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13

or maybe

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27

to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config makes 
me think there are two 14 core CPUs, so cores 0-13 would probably be CPU1 etc?)

(What is the actual topology of the system (according to, say 'nvidia-smi topo 
-m')?)

Tina

On 07/08/2020 16:31, Jodie H. Sprouse wrote:
Tina,
Thank you. Yes, jobs will run on all 4 gpus if I submit with: 
--gres-flags=disable-binding
Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only  job 
to never run on that particular cpu (having it  bound to the gpu and always 
free for a gpu job) and give the cpu job the maxcpus minus the 4.

* Hyperthreading is turned on.
NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 
CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000

PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 
State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0"
PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 
State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G" 
MaxCPUsPerNode=48

I have played tried variations for gres.conf such as:
NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3

as well as trying CORES= (rather than CPUSs) with NO success.


I’ve battled this all week. Any suggestions would be greatly appreciated!
Thanks for any suggestions!
Jodie


On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> wrote:

Hello,

This is something I've seen once on our systems & it took me a while to figure 
out what was going on.

The solution was that the system topology was such that all GPUs were connected 
to one CPU. There were no free cores on that particular CPU; so SLURM did not 
schedule any more jobs to the GPUs. Needed to disable binding in job submission 
to schedule to all of them.

Not sure that applies in your situation (don't know your system), but it's 
something to check?

Tina


On 07/08/2020 15:42, Jodie H. Sprouse wrote:
Good  morning.
I have having the same experience here. Wondering if you had a resolution?
Thank you.
Jodie


On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresn...@fau.edu 
<mailto:rresn...@fau.edu>> wrote:

We have several users submitting single GPU jobs to our cluster. We expected 
the jobs to fill each node and fully utilize the available GPU's but we instead 
find that only 2 out of the 4 gpu's in each node gets allocated.

If we request 2 GPU's in the job and start two jobs, both jobs will start on 
the same node fully allocating the node. We are puzzled about is going on and 
any hints are welcome.

Thanks for your help,

Rhian



*Example SBATCH Script*
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=longq7-mri
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --gres=gpu:1
#SBATCH --mail-type=ALL
hostname
echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES

set | grep SLURM
nvidia-smi
sleep 500




*gres.conf*
#AutoDetect=nvml
Name=gpu Type=v100  File=/dev/nvidia0 Cores=0
Name=gpu Type=v100  File=/dev/nvidia1 Cores=1
Name=gpu Type=v100  File=/dev/nvidia2 Cores=2
Name=gpu Type=v100  File=/dev/nvidia3 Cores=3


*slurm.conf*
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=cluster
ControlMachine=cluster-slurm1.example.com <http://cluster-slurm1.example.com/>
ControlAddr=10.116.0.11
BackupController=cluster-slurm2. <http://cluster-slurm2.example.com/>example.com 
<http://cluster-slurm2.example.com/>
BackupAddr=10.116.0.17
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
SchedulerPort=7321

RebootProgram="/usr/sbin/reboot"


AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid

GresTypes=gpu,mps,bandwidth

PrologFlags=x11
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=/etc/slurm/slurm.epilog.clean
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#bf_interval=10
#SchedulerAuth=
#SelectType=select/linear
# Cores and memory are consumable
#SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SchedulerParameters=bf_interval=10
SelectType=select/cons_res
SelectTypeParameters=CR_Core

FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
#
#
# Default values
# DefMemPerNode=64000
# DefCpuPerGPU=4
# DefMemPerCPU=4000
# DefMemPerGPU=16000



# OpenHPC default configuration
#TaskPlugin=task/affinity
TaskPlugin=task/affinity,task/cgroup
PropagateResourceLimitsExcept=MEMLOCK
TaskPluginParam=autobind=cores
#AccountingStorageType=accounting_storage/mysql
#StorageLoc=slurm_acct_db

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=cluster-slurmdbd1.example.com 
<http://cluster-slurmdbd1.example.com/>
#AccountingStorageType=accounting_storage/filetxt
Epilog=/etc/slurm/slurm.epilog.clean


#PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP
PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO  
Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025]


# Partitions

# Group Limited Queues

# OIT DEBUG QUEUE
PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP 
AllowGroups=oit-hpc-admin

# RNA CHEM
PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] 
AllowGroups=gpu-rnachem

# V100's
PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri

# BIGDATA GRANT
PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 
AllowGroups=fau-bigdata,nsf-bigdata

PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10  AllowAccounts=ALL 
 Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata

# CogNeuroLab
PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 MaxTime=7-12:00:00 
AllowGroups=cogneurolab Priority=200 State=UP Nodes=node[001-004]


# Standard queues

# OPEN TO ALL

#Short Queue
PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 
MaxTime=06:00:00 Priority=100 
Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025]
  Default=YES

# Medium Queue
PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 
MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100]

# Long Queue
PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100]


# Interactive
PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 
MaxTime=06:00:00 Priority=101 Nodes=node[001-100]  Default=No Hidden=YES

# Nodes

# Test nodes, (vms)
NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000

# AMD Nodes
NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 
ThreadsPerCore=1 Features=amd,epyc RealMemory=225436

# V100 MRI
NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 
CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 
RealMemory=192006

# GPU nodes
NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000
NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 
ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000
NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 
ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel RealMemory=128000

# IvyBridge nodes
NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
# SandyBridge node(2)
NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 
ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000
# IvyBridge
NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750
# Haswell
NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750


# Node health monitoring
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300
ReturnToService=2

# Fix for X11 issues
X11Parameters=use_raw_hostname



Rhian Resnick
Associate Director Research Computing
Enterprise Systems
Office of Information Technology

Florida Atlantic University
777 Glades Road, CM22, Rm 173B
Boca Raton, FL 33431
Phone 561.297.2647
Fax 561.297.0222



Reply via email to