My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory isn't
managed in Slurm so the idea of suspending GPU memory for another job to
use the rest simply isn't possible.

> Hi Kevin
> I did a "scontrol show partition".
> Oversubscribe was not enabled.
> I enable it in slurm.conf with:
> (...)
> GresTypes=gpu
> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
> but now it is working only with CPU jobs. It does not preempt gpu jobs.
> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
> after the timeslice as expected
> sbatch --cpus-per-task=32
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>                352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov
>                353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov
>                351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov
> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
> preempt the first 2 that start running.
> It says that the 3rd job is hanging on resources.
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>                356  asimov01      gpu  hdaniel PD       0:00      1
> (Resources)
>                354  asimov01      gpu  hdaniel  R       3:05      1 asimov
>                355  asimov01      gpu  hdaniel  R       3:02      1 asimov
> Do I need to change anything else in the configuration to support also gpu
> gang scheduling?
> Thanks
> ============================================================================
> scontrol show partition asimov01
> PartitionName=asimov01
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=YES QoS=N/A
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
>    Nodes=asimov
>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
>    State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
>    JobDefaults=DefCpuPerGPU=2
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
>> believe the time-slicing can be GANG scheduled
>> Can you do a "scontrol show partition" to verify that it is?
>>> Hi,
>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>> cores and 4 GPUs.
>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>> preempted after the time slice, which is set to 30 secs.
>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>>> at least until one of the other 2 ends) starving:
>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>>> (Resources)
>>>                311  asimov01 cpu-only  hdaniel  R       1:52      1
>>> asimov
>>>                312  asimov01 cpu-only  hdaniel  R       1:49      1
>>> asimov
>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>> each, the 5th job will never run. The preemption is not working with the
>>> specified timeslice.
>>> I tried several combinations:
>>> SchedulerType=sched/builtin  and backfill
>>> SelectType=select/cons_tres   and linear
>>> I'll appreciate any help and suggestions
>>> The slurm.conf is below.
>>> Thanks
>>> ClusterName=asimov
>>> SlurmctldHost=localhost
>>> MpiDefault=none
>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>> ReturnToService=2
>>> SlurmctldPidFile=/var/run/
>>> SlurmctldPort=6817
>>> SlurmdPidFile=/var/run/
>>> SlurmdPort=6818
>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>> SlurmUser=slurm
>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>> SwitchType=switch/none
>>> TaskPlugin=task/none # task/cgroup
>>> #
>>> # TIMERS
>>> InactiveLimit=0
>>> KillWait=30
>>> MinJobAge=300
>>> SlurmctldTimeout=120
>>> SlurmdTimeout=300
>>> Waittime=0
>>> #
>>> #FastSchedule=1 #obsolete
>>> SchedulerType=sched/builtin #backfill
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run
>>> at a time
>>> PreemptType = preempt/partition_prio
>>> PreemptMode = SUSPEND,GANG
>>> SchedulerTimeSlice=30           #in seconds, default 30
>>> #
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/none
>>> #AccountingStorageEnforce=associations
>>> #ClusterName=bip-cluster
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/linux
>>> SlurmctldDebug=info
>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>> SlurmdDebug=info
>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>> #
>>> #
>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> # Partitions
>>> GresTypes=gpu
>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>> State=UNKNOWN
>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
