Hi,
We maintain a cluster of about ~250 nodes - it runs Slurm version 21.08.6.
"scontrol show config" attached in the paste below.
Here is what we observed about the issue:
- The related job/script doesn't start at all and terminates immediately
(because the initial cgroup setup fails, we guess)
- Happens about once every day.
- It can put several nodes in DRAIN state at the same time. When it happens and
if we trace back taking the related job ids, we are led to one single user.
It's not the same user every time the issue occurs.
- It happen for array or regular jobs.
- If we look at the worker's log, we don't find any obvious correlation with
others jobs that may start or complete at the same time.
- We aren't able to find an obvious correlation with the CPU load of the nodes
or controller.
However we have been able to capture the error with slurmd in debug mode.
Please see the following log paste:
https://postit.hadoly.fr/?ed77c43716cefbc8#3a4Ukb9aFut93gDYZW6JpJJdgYzCCBYizUeAgh6G5rfT
I also attached it as a text file (if the list allows it) to this post for
future reference.
The reason of the node DRAIN set by slurm is "batch job complete failure".
The cgroup error is:
error: common_cgroup_instantiate: unable to create cgroup
'/sys/fs/cgroup/cpuset/slurm/uid_43197/job_4302857' : No such file or directory
The log entry "sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status:0"
reports either error 4020 or 4014
>From the user perspective, the job is shown as COMPLETED (which sounds
>counterintuitive)
Our current reading of the logs make us think that some folders entries are
randomly missing at the time the cgroup setup happens.
Do you think it could be a bug in slurm ? (maybe a race condition). If yes we
could carry on and report a bug.
It looks a bit like https://bugs.schedmd.com/show_bug.cgi?id=13136 at the
difference we're an not using multiple-slurmd and the cgroup error in our case
is about '/sys/fs/cgroup/cpuset', not '/sys/fs/cgroup/memory'.
Thank you for reading, any input is welcome !
Martin
The following log snippet shows a job that will put the node in DRAIN state
with reason "batch job complete failure"
(...)
[2022-04-05T02:59:57.980] [4302857.batch] debug3: Couldn't find sym
'slurm_spank_slurmd_exit' in the plugin
[2022-04-05T02:59:57.980] [4302857.batch] debug: spank:
/etc/slurm/plugstack.conf:35: Loaded plugin use-env.so
[2022-04-05T02:59:57.980] [4302857.batch] debug: SPANK: appending plugin
option "use-env"
[2022-04-05T02:59:57.980] [4302857.batch] debug2: spank: private-tmpdir.so:
init = 0
[2022-04-05T02:59:57.980] [4302857.batch] debug2: spank: use-env.so: init = 0
[2022-04-05T02:59:57.981] [4302857.batch] debug: private-tmpdir: mounting:
/scratch/slurm.4302857.0/tmp /tmp
[2022-04-05T02:59:57.981] [4302857.batch] debug2: spank: private-tmpdir.so:
init_post_opt = 0
[2022-04-05T02:59:57.981] [4302857.batch] debug2: After call to spank_init()
[2022-04-05T02:59:57.981] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'cgroup.clone_children' set to '0' for '/sys/fs/cgroup/cpuset/slurm'
[2022-04-05T02:59:57.983] [4302857.batch] error: common_cgroup_instantiate:
unable to create cgroup '/sys/fs/cgroup/cpuset/slurm/uid_43197/job_4302857' :
No such file or directory
[2022-04-05T02:59:57.983] [4302857.batch] error: _cpuset_create: unable to
instantiate job 4302857 cgroup
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_43197'
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'memory.use_hierarchy' set to '1' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857'
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'memory.use_hierarchy' set to '1' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch'
[2022-04-05T02:59:58.008] [4302857.batch] task/cgroup: _memcg_initialize: job:
alloc=3072MB mem.limit=3072MB memsw.limit=3072MB
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.limit_in_bytes' set to '3221225472' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857'
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.soft_limit_in_bytes' set to '3221225472' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857'
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.memsw.limit_in_bytes' set to '3221225472' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857'
[2022-04-05T02:59:58.008] [4302857.batch] task/cgroup: _memcg_initialize: step:
alloc=3072MB mem.limit=3072MB memsw.limit=3072MB
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.limit_in_bytes' set to '3221225472' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch'
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.soft_limit_in_bytes' set to '3221225472' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch'
[2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.memsw.limit_in_bytes' set to '3221225472' for
'/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch'
[2022-04-05T02:59:58.009] [4302857.batch] debug: cgroup/v1:
_oom_event_monitor: started.
[2022-04-05T02:59:58.009] [4302857.batch] debug: task_g_pre_setuid:
task/cgroup: Unspecified error
[2022-04-05T02:59:58.009] [4302857.batch] error: Failed to invoke task plugins:
one of task_p_pre_setuid functions returned error
[2022-04-05T02:59:58.009] [4302857.batch] debug: _fork_all_tasks failed
[2022-04-05T02:59:58.009] [4302857.batch] debug2: step_terminate_monitor will
run for 120 secs
[2022-04-05T02:59:58.009] [4302857.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'freezer.state' set
to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_43197/job_4302857/step_batch'
[2022-04-05T02:59:58.009] [4302857.batch] debug: signaling condition
[2022-04-05T02:59:58.009] [4302857.batch] debug2: step_terminate_monitor is
stopping
[2022-04-05T02:59:58.009] [4302857.batch] debug2: _monitor exit code: 0
[2022-04-05T02:59:58.021] [4302857.batch] debug3: cgroup/v1:
_oom_event_monitor: res: 1
[2022-04-05T02:59:58.021] [4302857.batch] debug: cgroup/v1:
_oom_event_monitor: oom-kill event count: 1
[2022-04-05T02:59:58.040] [4302857.batch] error: called without a previous
init. This shouldn't happen!
[2022-04-05T02:59:58.040] [4302857.batch] debug: jobacct_gather/cgroup: fini:
Job accounting gather cgroup plugin unloaded
[2022-04-05T02:59:58.040] [4302857.batch] error: called without a previous
init. This shouldn't happen!
[2022-04-05T02:59:58.041] [4302857.batch] error: called without a previous
init. This shouldn't happen!
[2022-04-05T02:59:58.041] [4302857.batch] debug: task/cgroup: fini: Tasks
containment cgroup plugin unloaded
[2022-04-05T02:59:58.041] [4302857.batch] debug2: Before call to spank_fini()
[2022-04-05T02:59:58.041] [4302857.batch] debug2: spank: private-tmpdir.so:
exit = 0
[2022-04-05T02:59:58.041] [4302857.batch] debug2: spank: use-env.so: exit = 0
[2022-04-05T02:59:58.041] [4302857.batch] debug2: After call to spank_fini()
[2022-04-05T02:59:58.041] [4302857.batch] error: job_manager: exiting
abnormally: Slurmd could not execve job
[2022-04-05T02:59:58.041] [4302857.batch] job 4302857 completed with slurm_rc =
4020, job_rc = 0
[2022-04-05T02:59:58.041] [4302857.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status:0
[2022-04-05T02:59:59.405] [4302857.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:59:59.405] [4302857.batch] debug2: false, shutdown
[2022-04-05T02:59:59.406] [4302857.batch] debug: Message thread exited
[2022-04-05T02:59:59.406] [4302857.batch] done with job
(...)
As a reference, the following log snippet show a job that start without any
issue.
(...)
[2022-04-05T02:43:28.220] [4304882.batch] debug3: Couldn't find sym
'slurm_spank_slurmd_exit' in the plugin
[2022-04-05T02:43:28.220] [4304882.batch] debug: spank:
/etc/slurm/plugstack.conf:35: Loaded plugin use-env.so
[2022-04-05T02:43:28.220] [4304882.batch] debug: SPANK: appending plugin
option "use-env"
[2022-04-05T02:43:28.220] [4304882.batch] debug2: spank: private-tmpdir.so:
init = 0
[2022-04-05T02:43:28.222] [4304882.batch] debug2: spank: use-env.so: init = 0
[2022-04-05T02:43:28.224] [4304882.batch] debug: private-tmpdir: mounting:
/scratch/slurm.4304882.0/tmp /tmp
[2022-04-05T02:43:28.224] [4304882.batch] debug2: spank: private-tmpdir.so:
init_post_opt = 0
[2022-04-05T02:43:28.224] [4304882.batch] debug2: After call to spank_init()
[2022-04-05T02:43:28.224] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'cgroup.clone_children' set to '0' for '/sys/fs/cgroup/cpuset/slurm'
[2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup:
task_cgroup_cpuset_create: job abstract cores are '20-21'
[2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup:
task_cgroup_cpuset_create: step abstract cores are '20-21'
[2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup:
task_cgroup_cpuset_create: job physical CPUs are '10,42'
[2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup:
task_cgroup_cpuset_create: step physical CPUs are '10,42'
[2022-04-05T02:43:28.224] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set
to '10,42,0-63' for '/sys/fs/cgroup/cpuset/slurm/uid_930'
[2022-04-05T02:43:28.224] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set
to '0-2,4-6' for '/sys/fs/cgroup/cpuset/slurm/uid_930'
[2022-04-05T02:43:28.226] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set
to '10,42' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882'
[2022-04-05T02:43:28.226] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set
to '0-2,4-6' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882'
[2022-04-05T02:43:28.227] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set
to '10,42' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882/step_batch'
[2022-04-05T02:43:28.227] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set
to '0-2,4-6' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882/step_batch'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_930'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'memory.use_hierarchy' set to '1' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_param: common_cgroup_set_param: parameter
'memory.use_hierarchy' set to '1' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch'
[2022-04-05T02:43:28.241] [4304882.batch] task/cgroup: _memcg_initialize: job:
alloc=6144MB mem.limit=6144MB memsw.limit=6144MB
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.limit_in_bytes' set to '6442450944' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.soft_limit_in_bytes' set to '6442450944' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.memsw.limit_in_bytes' set to '6442450944' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882'
[2022-04-05T02:43:28.241] [4304882.batch] task/cgroup: _memcg_initialize: step:
alloc=6144MB mem.limit=6144MB memsw.limit=6144MB
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.limit_in_bytes' set to '6442450944' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.soft_limit_in_bytes' set to '6442450944' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch'
[2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1:
common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter
'memory.memsw.limit_in_bytes' set to '6442450944' for
'/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch'
[2022-04-05T02:43:28.242] [4304882.batch] debug: cgroup/v1:
_oom_event_monitor: started.
[2022-04-05T02:43:28.242] [4304882.batch] debug2: hwloc_topology_load
[2022-04-05T02:43:28.286] [4304882.batch] debug2: hwloc_topology_export_xml
[2022-04-05T02:43:28.293] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.293] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.296] [4304882.batch] debug2: Entering _setup_normal_io
[2022-04-05T02:43:28.296] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.296] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.296] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] debug2: Leaving _setup_normal_io
[2022-04-05T02:43:28.309] [4304882.batch] debug levels are stderr='error',
logfile='debug3', syslog='quiet'
[2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable
[2022-04-05T02:43:28.309] [4304882.batch] starting 1 tasks
[2022-04-05T02:43:28.310] [4304882.batch] task 0 (29705) started
2022-04-05T02:43:28
(...)
"scontrol show config" output:
Configuration data as of 2022-04-14T13:57:54
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost = ...
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort = 6819
AccountingStorageTRES =
cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/gpu:k80,gres/gpu:v100
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreFlags = job_comment
AcctGatherEnergyType = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes = auth/jwt
AuthAltParameters = jwt_key=/etc/slurm/jwt_hs256.key
AuthInfo = (null)
AuthType = auth/munge
BatchStartTimeout = 10 sec
BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters = (null)
BOOT_TIME = 2022-04-11T14:26:48
BurstBufferType = (null)
CliFilterPlugins = (null)
ClusterName = ccslurmlocal
CommunicationParameters = (null)
CompleteWait = 0 sec
CoreSpecPlugin = core_spec/none
CpuFreqDef = Unknown
CpuFreqGovernors = OnDemand,Performance,UserSpace
CredType = cred/munge
DebugFlags = CPU_Bind,Gres
DefMemPerNode = UNLIMITED
DependencyParameters = (null)
DisableRootJobs = Yes
EioTimeout = 60
EnforcePartLimits = NO
Epilog = /etc/slurm/epilog.sh
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = ext_sensors/none
ExtSensorsFreq = 0 sec
FairShareDampeningFactor = 1
FederationParameters = (null)
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = gpu
GpuFreqDef = high,memory=high
GroupUpdateForce = 1
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckNodeState = ANY
HealthCheckProgram = (null)
InactiveLimit = 0 sec
InteractiveStepOptions = --interactive
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/cgroup
JobAcctGatherParams = (null)
JobCompHost = localhost
JobCompLoc = /var/log/slurm_jobcomp.log
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobContainerType = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = lua
KeepAliveTime = SYSTEM_DEFAULT
KillOnBadExit = 0
KillWait = 30 sec
LaunchParameters = use_interactive_step
LaunchType = launch/slurm
Licenses = ...
LogTimeFormat = iso8601_ms
MailDomain = (null)
MailProg = /bin/mail
MaxArraySize = 1000001
MaxDBDMsgs = 20016
MaxJobCount = 40000
MaxJobId = 67043328
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 512
MCSPlugin = mcs/none
MCSParameters = (null)
MessageTimeout = 30 sec
MinJobAge = 60 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 5670091
NodeFeaturesPlugins = (null)
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = (null)
PowerParameters = (null)
PowerPlugin =
PreemptMode = OFF
PreemptType = preempt/none
PreemptExemptTime = 00:00:00
PrEpParameters = (null)
PrEpPlugins = prep/script
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags =
PriorityMaxAge = 4-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 100
PriorityWeightAssoc = 0
PriorityWeightFairShare = 1000
PriorityWeightJobSize = 0
PriorityWeightPartition = 0
PriorityWeightQOS = 10
PriorityWeightTRES = (null)
PrivateData = accounts,events,jobs,reservations,usage,users
ProctrackType = proctrack/cgroup
Prolog = /etc/slurm/prolog.sh
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = Alloc
PropagatePrioProcess = 0
PropagateResourceLimits = NONE
PropagateResourceLimitsExcept = (null)
RebootProgram = (null)
ReconfigFlags = (null)
RequeueExit = (null)
RequeueExitHold = (null)
ResumeFailProgram = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 1
RoutePlugin = route/default
SchedulerParameters = pack_serial_at_end,max_rpc_cnt=40
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
ScronParameters = (null)
SelectType = select/cons_tres
SelectTypeParameters = CR_CPU_MEMORY
SlurmUser = slurm(9912)
SlurmctldAddr = (null)
SlurmctldDebug = info
SlurmctldHost[0] = ...01
SlurmctldHost[1] = ...02
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmctldPort = 6817
SlurmctldSyslogDebug = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg = (null)
SlurmctldTimeout = 120 sec
SlurmctldParameters = (null)
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdParameters = (null)
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/spool/slurmd
SlurmdSyslogDebug = unknown
SlurmdTimeout = 500 sec
SlurmdUser = root(0)
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPlugstack = (null)
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 21.08.6
SrunEpilog = (null)
SrunPortRange = 40001-49999
SrunProlog = (null)
StateSaveLocation = /pbs/slurm/prod21.08.6/var/spool/slurmctld
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = INFINITE
SuspendTimeout = 30 sec
SwitchParameters = (null)
SwitchType = switch/none
TaskEpilog = /etc/slurm/taskepilog.sh
TaskPlugin = task/cgroup,task/affinity
TaskPluginParam = (null type)
TaskProlog = /etc/slurm/taskprolog.sh
TCPTimeout = 6 sec
TmpFS = /tmp
TopologyParam = (null)
TopologyPlugin = topology/none
TrackWCKey = No
TreeWidth = 50
UsePam = Yes
UnkillableStepProgram = (null)
UnkillableStepTimeout = 120 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
X11Parameters = (null)
Cgroup Support Configuration:
AllowedDevicesFile = (null)
AllowedKmemSpace = (null)
AllowedRAMSpace = 100.0%
AllowedSwapSpace = 0.0%
CgroupAutomount = no
CgroupMountpoint = (null)
CgroupPlugin = (null)
ConstrainCores = no
ConstrainDevices = no
ConstrainKmemSpace = no
ConstrainRAMSpace = no
ConstrainSwapSpace = no
MaxKmemPercent = 100.0%
MaxRAMPercent = 100.0%
MaxSwapPercent = 100.0%
MemorySwappiness = (null)
MinKmemSpace = 30 MB
MinRAMSpace = 30 MB
TaskAffinity = no
Slurmctld(primary) at ...01 is UP
Slurmctld(backup) at ...02 is UP