Hi Kevin I did a "scontrol show partition". Oversubscribe was not enabled. I enable it in slurm.conf with:
(...) GresTypes=gpu NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP but now it is working only with CPU jobs. It does not preempt gpu jobs. Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt after the timeslice as expected sbatch --cpus-per-task=32 test-cpu.sh JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 352 asimov01 cpu-only hdaniel R 0:58 1 asimov 353 asimov01 cpu-only hdaniel R 0:25 1 asimov 351 asimov01 cpu-only hdaniel S 0:36 1 asimov But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not preempt the first 2 that start running. It says that the 3rd job is hanging on resources. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 356 asimov01 gpu hdaniel PD 0:00 1 (Resources) 354 asimov01 gpu hdaniel R 3:05 1 asimov 355 asimov01 gpu hdaniel R 3:02 1 asimov Do I need to change anything else in the configuration to support also gpu gang scheduling? Thanks ============================================================================ scontrol show partition asimov01 PartitionName=asimov01 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=asimov PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=DefCpuPerGPU=2 DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbr...@rivosinc.com> wrote: > Problem might be that OverSubscribe is not enabled? w/o it, I don't > believe the time-slicing can be GANG scheduled > > Can you do a "scontrol show partition" to verify that it is? > > On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdan...@ualg.pt> wrote: > >> Hi, >> >> I am trying to enable gang scheduling on a server with a CPU with 32 >> cores and 4 GPUs. >> >> However, using Gang sched, the cpu jobs (or gpu jobs) are not being >> preempted after the time slice, which is set to 30 secs. >> >> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. >> The first 2 jobs launched are never preempted. The 3rd job is forever (or >> at least until one of the other 2 ends) starving: >> >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> 313 asimov01 cpu-only hdaniel PD 0:00 1 >> (Resources) >> 311 asimov01 cpu-only hdaniel R 1:52 1 asimov >> 312 asimov01 cpu-only hdaniel R 1:49 1 asimov >> >> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU >> each, the 5th job will never run. The preemption is not working with the >> specified timeslice. >> >> I tried several combinations: >> >> SchedulerType=sched/builtin and backfill >> SelectType=select/cons_tres and linear >> >> I'll appreciate any help and suggestions >> The slurm.conf is below. >> Thanks >> >> ClusterName=asimov >> SlurmctldHost=localhost >> MpiDefault=none >> ProctrackType=proctrack/linuxproc # proctrack/cgroup >> ReturnToService=2 >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmctldPort=6817 >> SlurmdPidFile=/var/run/slurmd.pid >> SlurmdPort=6818 >> SlurmdSpoolDir=/var/lib/slurm/slurmd >> SlurmUser=slurm >> StateSaveLocation=/var/lib/slurm/slurmctld >> SwitchType=switch/none >> TaskPlugin=task/none # task/cgroup >> # >> # TIMERS >> InactiveLimit=0 >> KillWait=30 >> MinJobAge=300 >> SlurmctldTimeout=120 >> SlurmdTimeout=300 >> Waittime=0 >> # >> # SCHEDULING >> #FastSchedule=1 #obsolete >> SchedulerType=sched/builtin #backfill >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core #CR_Core_Memory let's only one job run at >> a time >> PreemptType = preempt/partition_prio >> PreemptMode = SUSPEND,GANG >> SchedulerTimeSlice=30 #in seconds, default 30 >> # >> # LOGGING AND ACCOUNTING >> #AccountingStoragePort= >> AccountingStorageType=accounting_storage/none >> #AccountingStorageEnforce=associations >> #ClusterName=bip-cluster >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_gather/linux >> SlurmctldDebug=info >> SlurmctldLogFile=/var/log/slurm/slurmctld.log >> SlurmdDebug=info >> SlurmdLogFile=/var/log/slurm/slurmd.log >> # >> # >> # COMPUTE NODES >> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN >> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> >> # Partitions >> GresTypes=gpu >> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 >> State=UNKNOWN >> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE >> MaxNodes=1 DefCpuPerGPU=2 State=UP >> >> -- com os melhores cumprimentos, Helder Daniel Universidade do Algarve Faculdade de Ciências e Tecnologia Departamento de Engenharia Electrónica e Informática https://www.ualg.pt/pt/users/hdaniel