CPUs are released, but memory is not released on suspend.  Try looking at this 
output and compare allocated Memory before and after suspending a job on a node:

sinfo -N -n yourNode 
--Format=weight:8,nodelist:15,cpusstate:12,memory:8,allocmem:8

From: Verma, Nischey (HPC ENG,RAL,LSCI) via slurm-users 
<slurm-users@lists.schedmd.com>
Sent: Friday, March 15, 2024 11:06 AM
To: slurm-users@lists.schedmd.com
Cc: Taneja, Sonia (DLSLtd,RAL,LSCI) <sonia.tan...@diamond.ac.uk>
Subject: [slurm-users] Slurm suspend preemption not working

Hi All,
we are trying to implement preemption in one of our partitions so we can run 
priority jobs on it and suspend the ones running on the partition and resume 
once the priority job is done. We have read through the Slurm documentation and 
did the configuration, but somehow we can not make it work. We tried other 
preemption like cancel which works fine, but Suspend isn't working. The jobs 
with higher priority which should suspend other jobs stays as pending and 
waiting for resources. We are using a QOS to assign priority to the job and 
also configuring the QOS so it can preempt certain other QOSs. this is the 
output of our scontrol show config :

AccountingStorageBackupHost = (null)

AccountingStorageEnforce = associations,limits,qos

AccountingStorageHost   = -Some-server-

AccountingStorageExternalHost = (null)

AccountingStorageParameters = (null)

AccountingStoragePort   = 6819

AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages

AccountingStorageType   = accounting_storage/slurmdbd

AccountingStorageUser   = N/A

AccountingStoreFlags    = job_comment,job_env,job_extra,job_script

AcctGatherEnergyType    = (null)

AcctGatherFilesystemType = (null)

AcctGatherInterconnectType = (null)

AcctGatherNodeFreq      = 0 sec

AcctGatherProfileType   = (null)

AllowSpecResourcesUsage = No

AuthAltTypes            = (null)

AuthAltParameters       = (null)

AuthInfo                = (null)

AuthType                = auth/munge

BatchStartTimeout       = 10 sec

BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64

BcastParameters         = (null)

BOOT_TIME               = 2024-03-15T13:32:05

BurstBufferType         = (null)

CliFilterPlugins        = (null)

ClusterName             = cluster

CommunicationParameters = (null)

CompleteWait            = 0 sec

CoreSpecPlugin          = (null)

CpuFreqDef              = Unknown

CpuFreqGovernors        = OnDemand,Performance,UserSpace

CredType                = cred/munge

DebugFlags              = (null)

DefMemPerCPU            = 3500

DependencyParameters    = (null)

DisableRootJobs         = Yes

EioTimeout              = 60

EnforcePartLimits       = ALL

Epilog                  = (null)

EpilogMsgTime           = 2000 usec

EpilogSlurmctld         = (null)

ExtSensorsType          = (null)

ExtSensorsFreq          = 0 sec

FairShareDampeningFactor = 1

FederationParameters    = (null)

FirstJobId              = 1

GetEnvTimeout           = 2 sec

GresTypes               = gpu

GpuFreqDef              = (null)

GroupUpdateForce        = 1

GroupUpdateTime         = 600 sec

HASH_VAL                = Match

HealthCheckInterval     = 300 sec

HealthCheckNodeState    = ANY

HealthCheckProgram      = /usr/sbin/nhc

InactiveLimit           = 0 sec

InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL

JobAcctGatherFrequency  = 30

JobAcctGatherType       = jobacct_gather/linux

JobAcctGatherParams     = (null)

JobCompHost             = localhost

JobCompLoc              = (null)

JobCompParams           = (null)

JobCompPort             = 0

JobCompType             = (null)

JobCompUser             = root

JobContainerType        = (null)

JobDefaults             = (null)

JobFileAppend           = 0

JobRequeue              = 1

JobSubmitPlugins        = lua

KillOnBadExit           = 0

KillWait                = 30 sec

LaunchParameters        = (null)

Licenses                = (null)

LogTimeFormat           = iso8601_ms

MailDomain              = (null)

MailProg                = /bin/mail

MaxArraySize            = 1001

MaxBatchRequeue         = 5

MaxDBDMsgs              = 20024

MaxJobCount             = 10000

MaxJobId                = 67043328

MaxMemPerNode           = UNLIMITED

MaxNodeCount            = 6

MaxStepCount            = 40000

MaxTasksPerNode         = 512

MCSPlugin               = (null)

MCSParameters           = (null)

MessageTimeout          = 10 sec

MinJobAge               = 300 sec

MpiDefault              = pmix_v2

MpiParams               = (null)

NEXT_JOB_ID             = 3626

NodeFeaturesPlugins     = (null)

OverTimeLimit           = 0 min

PluginDir               = /usr/lib64/slurm

PlugStackConfig         = (null)

PowerParameters         = (null)

PowerPlugin             = (null)

PreemptMode             = GANG,SUSPEND

PreemptParameters       = (null)

PreemptType             = preempt/qos

PreemptExemptTime       = 00:00:00

PrEpParameters          = (null)

PrEpPlugins             = prep/script

PriorityParameters      = (null)

PrioritySiteFactorParameters = (null)

PrioritySiteFactorPlugin = (null)

PriorityDecayHalfLife   = 7-00:00:00

PriorityCalcPeriod      = 00:05:00

PriorityFavorSmall      = No

PriorityFlags           =

PriorityMaxAge          = 7-00:00:00

PriorityUsageResetPeriod = NONE

PriorityType            = priority/multifactor

PriorityWeightAge       = 0

PriorityWeightAssoc     = 0

PriorityWeightFairShare = 0

PriorityWeightJobSize   = 0

PriorityWeightPartition = 0

PriorityWeightQOS       = 5000

PriorityWeightTRES      = (null)

PrivateData             = none

ProctrackType           = proctrack/cgroup

Prolog                  = (null)

PrologEpilogTimeout     = 65534

PrologSlurmctld         = (null)

PrologFlags             = (null)

PropagatePrioProcess    = 0

PropagateResourceLimits = ALL

PropagateResourceLimitsExcept = (null)

RebootProgram           = /etc/slurm/slurmupdate.sh

ReconfigFlags           = (null)

RequeueExit             = (null)

RequeueExitHold         = (null)

ResumeFailProgram       = (null)

ResumeProgram           = (null)

ResumeRate              = 300 nodes/min

ResumeTimeout           = 60 sec

ResvEpilog              = (null)

ResvOverRun             = 0 min

ResvProlog              = (null)

ReturnToService         = 1

SchedulerParameters     = bf_max_job_user=2

SchedulerTimeSlice      = 30 sec

SchedulerType           = sched/backfill

ScronParameters         = (null)

SelectType              = select/cons_tres

SelectTypeParameters    = CR_CORE_MEMORY

SlurmUser               = slurm(888)

SlurmctldAddr           = (null)

SlurmctldDebug          = info

SlurmctldHost[0]        = -some-server-

SlurmctldLogFile        = /var/log/slurm/slurmctld.log

SlurmctldPort           = 6817

SlurmctldSyslogDebug    = (null)

SlurmctldPrimaryOffProg = (null)

SlurmctldPrimaryOnProg  = (null)

SlurmctldTimeout        = 120 sec

SlurmctldParameters     = (null)

SlurmdDebug             = info

SlurmdLogFile           = /var/log/slurm/slurmd.log

SlurmdParameters        = (null)

SlurmdPidFile           = /var/run/slurmd.pid

SlurmdPort              = 6818

SlurmdSpoolDir          = /var/spool/slurm

SlurmdSyslogDebug       = (null)

SlurmdTimeout           = 300 sec

SlurmdUser              = root(0)

SlurmSchedLogFile       = (null)

SlurmSchedLogLevel      = 0

SlurmctldPidFile        = /var/run/slurmctld.pid

SLURM_CONF              = /etc/slurm/slurm.conf

SLURM_VERSION           = 23.11.1

SrunEpilog              = (null)

SrunPortRange           = 0-0

SrunProlog              = (null)

StateSaveLocation       = /var/spool/slurmctld

SuspendExcNodes         = (null)

SuspendExcParts         = (null)

SuspendExcStates        = (null)

SuspendProgram          = (null)

SuspendRate             = 60 nodes/min

SuspendTime             = INFINITE

SuspendTimeout          = 30 sec

SwitchParameters        = (null)

SwitchType              = (null)

TaskEpilog              = (null)

TaskPlugin              = task/cgroup,task/affinity

TaskPluginParam         = none

TaskProlog              = (null)

TCPTimeout              = 2 sec

TmpFS                   = /tmp

TopologyParam           = (null)

TopologyPlugin          = topology/default

TrackWCKey              = No

TreeWidth               = 16

UsePam                  = No

UnkillableStepProgram   = (null)

UnkillableStepTimeout   = 60 sec

VSizeFactor             = 0 percent

WaitTime                = 0 sec

X11Parameters           = (null)



Cgroup Support Configuration:

AllowedRAMSpace         = 100.0%

AllowedSwapSpace        = 0.0%

CgroupMountpoint        = /sys/fs/cgroup

CgroupPlugin            = autodetect

ConstrainCores          = yes

ConstrainDevices        = yes

ConstrainRAMSpace       = yes

ConstrainSwapSpace      = no

EnableControllers       = no

IgnoreSystemd           = no

IgnoreSystemdOnFailure  = no

MaxRAMPercent           = 100.0%

MaxSwapPercent          = 100.0%

MemorySwappiness        = (null)

MinRAMSpace             = 30 MB



MPI Plugins Configuration:

PMIxCliTmpDirBase       = (null)

PMIxCollFence           = (null)

PMIxDebug               = 0

PMIxDirectConn          = yes

PMIxDirectConnEarly     = no

PMIxDirectConnUCX       = no

PMIxDirectSameArch      = no

PMIxEnv                 = (null)

PMIxFenceBarrier        = no

PMIxNetDevicesUCX       = (null)

PMIxTimeout             = 300

PMIxTlsUCX              = (null)
the partition is configured like:

PartitionName=test-partition

   AllowGroups=sysadmin,users AllowAccounts=ALL AllowQos=ALL

   AllocNodes=ALL Default=NO QoS=N/A

   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO

   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED

   Nodes=vserv-[275-277]

   PriorityJobFactor=1 PriorityTier=100 RootOnly=NO ReqResv=NO 
OverSubscribe=FORCE:1

   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND

   State=UP TotalCPUs=32 TotalNodes=4 SelectTypeParameters=NONE

   JobDefaults=(null)

   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

   TRES=cpu=32,mem=304933M,node=4,billing=32
Our QOS looks like this:

|      Name | Priority | GraceTime |  Preempt  | PreemptMode |          Flags  
| UsageFactor |   MaxTRESPU     | MaxJobsPU | MaxSubmitPU

|-----------|----------|-----------|-----------|-------------|-----------------|-------------|-----------------|-----------|-----------

| normal    |    50    | 00:00:00  |           |   cluster   |                 
|  1.000000   |                 |    20     |    50

| preempter |   1000   | 00:00:00  | preempted |  gang,suspe+|                 
|  1.000000   |                 |           |

| preempted |      0   | 00:00:00  |           | gang,suspe+ |     OverPartQOS 
|  1.000000   | cpu=100,node=10 |           |
I can provide more configs if needed. Do you guys see anything strange? Or any 
property to be set?
This is the state of the queue:

|JOBID             | QOS       | ST      | TIME      |  NODELIST(REASON)    | 
PARTITION       | PRIORITY |

| ----------------  ----------- --------- ----------- ---------------------- 
----------------- ----------|

| 3629             | preempted | PD      | 0:00      |  (Resources)         | 
test-partition  |    1     |

| 3627             | preempted | R       | 0:20      |  vserv-276           | 
test-partition  |    1     |

| 3628             | preempted | R       | 0:20      |  vserv-277           | 
test-partition  |    1     |

| 3626             | preempted | R       | 0:27      |  vserv-275           | 
test-partition  |    1     |

| 3630             | preempter | PD      | 0:00      |  (Resources)         | 
test-partition  |   5000   |
any advice is welcomed.


Regards,

Nischey Verma



Nischey Verma


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to