I've got what looks like a race condition followed by an infinite loop in one of the slurmctld threads (slurmctld_sched) that appears to hinge around the SelectTypeParameter CR_ONE_TASK_PER_CORE on an IBM POWER8 little-endian machine that has 8 HW threads per core.

I've attached my slurm.conf for reference

During the lockup, one of the slurmctld threads goes to 100% CPU utilization, all communication with slurmctld hangs, and slurm starts spawning extra slurmctld threads every few seconds seemingly without limit.

I can't see anything in the logs, even when running slurmctld in the foreground with five -v's.

I have seen this in 14.11.6 as well as 14.11.7

The cluster is ppc64le, all nodes including the management have two 24-core, 8-way SMT IBM POWER8 cpus, Ubuntu 14.04

The behavior exhibits itself when we run a series of jobs (in the neighborhood of 15-20) with dependencies

--
Chandler Wilkerson
Center for Research Computing
Rice University
# Author: CHW, KKT
# Date: 2015-05-04
#
ControlMachine=poman
#
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
DisableRootJobs=Yes
EnforcePartLimits=YES
#MaxJobId=9999999
MpiDefault=pmi2
#PluginDir=/usr/local/lib/slurm
#ProctrackType=proctrack/pgid
##Proctracktype=proctrack/linuxproc
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6816-6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
##TaskPlugin=task/none
TaskPlugin=task/cgroup
##TaskPlugin=task/affinity
##TaskPluginParam=Sched
UsePam=1
#Epilog=/opt/apps/slurm/scripts/slurm.epilog.clean
PropagateResourceLimitsExcept=MEMLOCK
MailProg=/usr/bin/mail

#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
#SchedulerParameters=kill_invalid_depend,max_depend_depth=3,bf_continue,bf_max_job_test=20,default_queue_depth=20
SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
#SelectType=select/linear
#SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Memory

#
#
# JOB PRIORITY
# Prioritize emphasizing on fairshare use and job size
#

PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityFavorSmall=No
PriorityMaxAge=14-0
PriorityWeightAge=200000
PriorityWeightFairshare=500000
PriorityWeightJobSize=300000
PriorityWeightPartition=0
PriorityWeightQOS=1000000
FairShareDampeningFactor=100
#
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=slurmdb.rcsg.rice.edu
#AccountingStorageHost=poman
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=NO
ClusterName=po
JobCompHost=poman
JobCompLoc=/var/spool/slurm/job_completions
JobCompType=jobcomp/filetxt
JobCompUser=slurm
JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/linux
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=3
SlurmctldLogFile=/var/spool/slurm/logs/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/spool/slurm/logs/slurmd.log
SlurmSchedLogFile=/var/spool/slurm/logs/slurmsched.log 
SlurmSchedLogLevel=3 

#
#
# Health Checks (NHC)
HealthCheckInterval=600
HealthCheckProgram=/opt/apps/system/warewulf/sbin/nhc

#
#
# Preemption policy
#PreemptType=preempt/partition_prio
#PreemptMode=requeue

#
# COMPUTE NODES
GresTypes=gpu
NodeName=po[1-3] Weight=1 Sockets=4 CoresPerSocket=6 ThreadsPerCore=8 
State=UNKNOWN RealMemory=261356
NodeName=po[4-5] Weight=3 Sockets=4 CoresPerSocket=6 ThreadsPerCore=8 
State=UNKNOWN RealMemory=1046347
NodeName=po[6-7] Weight=2 Sockets=4 CoresPerSocket=6 ThreadsPerCore=8 
State=UNKNOWN RealMemory=261356 Gres=gpu:kepler:2
#NodeName=po5 Weight=2 State=UNKNOWN RealMemory=1046347
#NodeName=po[6-7] Weight=3 State=UNKNOWN Gres=gpu:kepler:2 RealMemory=261356

#
# PARTITIONS/QUEUES
PartitionName=commons Nodes=po[1-7] DefaultTime=0 MaxTime=24:00:00 
PreemptMode=off Shared=Yes Default=Yes DefMemPerCPU=1024
#PartitionName=commons Nodes=po[5-7] DefaultTime=0 MaxTime=24:00:00 
DefMemPerCPU=1024 MaxMemPerCPU=43960        MinNodes=2      State=UP 
PreemptMode=off        Shared=Yes      Default=Yes     AllocNodes=poman,po2
#PartitionName=interactive Nodes=po[1-3] DefaultTime=0  MaxTime=00:30:00 
DefMemPerCPU=1024 MaxMemPerCPU=43960 State=UP PreemptMode=off Shared=Yes 
Default=No AllocNodes=poman,po2

Reply via email to