Hi All,
we are trying to implement preemption in one of our partitions so we can run 
priority jobs on it and suspend the ones running on the partition and resume 
once the priority job is done. We have read through the Slurm documentation and 
did the configuration, but somehow we can not make it work. We tried other 
preemption like cancel which works fine, but Suspend isn't working. The jobs 
with higher priority which should suspend other jobs stays as pending and 
waiting for resources. We are using a QOS to assign priority to the job and 
also configuring the QOS so it can preempt certain other QOSs. this is the 
output of our scontrol show config :

AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost   = -Some-server-
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreFlags    = job_comment,job_env,job_extra,job_script
AcctGatherEnergyType    = (null)
AcctGatherFilesystemType = (null)
AcctGatherInterconnectType = (null)
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = (null)
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters         = (null)
BOOT_TIME               = 2024-03-15T13:32:05
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = cluster
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = (null)
CpuFreqDef              = Unknown
CpuFreqGovernors        = OnDemand,Performance,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerCPU            = 3500
DependencyParameters    = (null)
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = ALL
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = (null)
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = (null)
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 300 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = /usr/sbin/nhc
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = (null)
JobCompParams           = (null)
JobCompPort             = 0
JobCompType             = (null)
JobCompUser             = root
JobContainerType        = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = lua
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxBatchRequeue         = 5
MaxDBDMsgs              = 20024
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxNodeCount            = 6
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = (null)
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = pmix_v2
MpiParams               = (null)
NEXT_JOB_ID             = 3626
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             = (null)
PreemptMode             = GANG,SUSPEND
PreemptParameters       = (null)
PreemptType             = preempt/qos
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           =
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 0
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 5000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = /etc/slurm/slurmupdate.sh
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
SchedulerParameters     = bf_max_job_user=2
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(888)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = -some-server-
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm
SlurmdSyslogDebug       = (null)
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 23.11.1
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurmctld
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendExcStates        = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = INFINITE
SuspendTimeout          = 30 sec
SwitchParameters        = (null)
SwitchType              = (null)
TaskEpilog              = (null)
TaskPlugin              = task/cgroup,task/affinity
TaskPluginParam         = none
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/default
TrackWCKey              = No
TreeWidth               = 16
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupMountpoint        = /sys/fs/cgroup
CgroupPlugin            = autodetect
ConstrainCores          = yes
ConstrainDevices        = yes
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
EnableControllers       = no
IgnoreSystemd           = no
IgnoreSystemdOnFailure  = no
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinRAMSpace             = 30 MB

MPI Plugins Configuration:
PMIxCliTmpDirBase       = (null)
PMIxCollFence           = (null)
PMIxDebug               = 0
PMIxDirectConn          = yes
PMIxDirectConnEarly     = no
PMIxDirectConnUCX       = no
PMIxDirectSameArch      = no
PMIxEnv                 = (null)
PMIxFenceBarrier        = no
PMIxNetDevicesUCX       = (null)
PMIxTimeout             = 300
PMIxTlsUCX              = (null)

the partition is configured like:

PartitionName=test-partition
   AllowGroups=sysadmin,users AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=vserv-[275-277]
   PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO 
OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=32,mem=304933M,node=4,billing=32

Our QOS looks like this:

|      Name | Priority | GraceTime |  Preempt  | PreemptMode |          Flags  
| UsageFactor |   MaxTRESPU     | MaxJobsPU | MaxSubmitPU
|-----------|----------|-----------|-----------|-------------|-----------------|-------------|-----------------|-----------|-----------
| normal    |    50    | 00:00:00  |           |   cluster   |                 
|  1.000000   |                 |    20     |    50
| preempter |   1000   | 00:00:00  | preempted |  gang,suspe+|                 
|  1.000000   |                 |           |
| preempted |      0   | 00:00:00  |           | gang,suspe+ |     OverPartQOS 
|  1.000000   | cpu=100,node=10 |           |

I can provide more configs if needed. Do you guys see anything strange? Or any 
property to be set?
This is the state of the queue:

|JOBID             | QOS       | ST      | TIME      |  NODELIST(REASON)    | 
PARTITION       | PRIORITY |
| ----------------  ----------- --------- ----------- ---------------------- 
----------------- ----------|
| 3629             | preempted | PD      | 0:00      |  (Resources)         | 
test-partition  |    1     |
| 3627             | preempted | R       | 0:20      |  vserv-276           | 
test-partition  |    1     |
| 3628             | preempted | R       | 0:20      |  vserv-277           | 
test-partition  |    1     |
| 3626             | preempted | R       | 0:27      |  vserv-275           | 
test-partition  |    1     |
| 3630             | preempter | PD      | 0:00      |  (Resources)         | 
test-partition  |   5000   |

any advice is welcomed.


Regards,

Nischey Verma



Nischey Verma

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to