If anyone has advice, I would really appreciate...
I am running (just installed) slurm-11.17.6, with a master + 2 hosts. It works
locally on the master (controller + execution). However, I cannot establish
communication from master [triumph01] with the 2 hosts [triumph02,triumph03].
Here is some more info:
1. munge is running, and munge verification tests all pass.
2. system clocks are in sync on master/hosts.
3. identical slurm.conf files are on master/hosts.
4. configuration of resources (memory/cpus/etc) are correct and have been
confirmed on all machines (all hardware is identical).
5. I have attached:
a) slurm.conf
b) log file from master slurmctld
c) log file from host slurmd
Any ideas about what to try next?
Heath Turner
Professor
Graduate Coordinator
Chemical and Biological Engineering
http://che.eng.ua.edu
University of Alabama
3448 SEC, Box 870203
Tuscaloosa, AL 35487
(205) 348-1733 (phone)
(205) 561-7450 (cell)
(205) 348-7558 (fax)
htur...@eng.ua.edu
http://turnerresearchgroup.ua.edu
[2018-05-21T01:49:33.608] debug: Log file re-opened
[2018-05-21T01:49:33.609] debug: sched: slurmctld starting
[2018-05-21T01:49:33.609] Warning: Core limit is only 0 KB
[2018-05-21T01:49:33.609] slurmctld version 17.11.6 started on cluster cluster
[2018-05-21T01:49:33.610] Munge cryptographic signature plugin loaded
[2018-05-21T01:49:33.610] Consumable Resources (CR) Node Selection plugin
loaded with argument 4
[2018-05-21T01:49:33.610] preempt/none loaded
[2018-05-21T01:49:33.610] debug: Checkpoint plugin loaded: checkpoint/none
[2018-05-21T01:49:33.610] debug: AcctGatherEnergy NONE plugin loaded
[2018-05-21T01:49:33.610] debug: AcctGatherProfile NONE plugin loaded
[2018-05-21T01:49:33.610] debug: AcctGatherInterconnect NONE plugin loaded
[2018-05-21T01:49:33.610] debug: AcctGatherFilesystem NONE plugin loaded
[2018-05-21T01:49:33.610] debug2: No acct_gather.conf file
(/usr/local/etc/acct_gather.conf)
[2018-05-21T01:49:33.610] debug: Job accounting gather NOT_INVOKED plugin
loaded
[2018-05-21T01:49:33.610] ExtSensors NONE plugin loaded
[2018-05-21T01:49:33.610] debug: switch NONE plugin loaded
[2018-05-21T01:49:33.610] debug: power_save module disabled, SuspendTime < 0
[2018-05-21T01:49:33.611] debug: No backup controller to shutdown
[2018-05-21T01:49:33.611] Accounting storage NOT INVOKED plugin loaded
[2018-05-21T01:49:33.611] debug: Reading slurm.conf file:
/usr/local/etc/slurm.conf
[2018-05-21T01:49:33.612] layouts: no layout to initialize
[2018-05-21T01:49:33.612] topology NONE plugin loaded
[2018-05-21T01:49:33.612] debug: No DownNodes
[2018-05-21T01:49:33.613] debug: Log file re-opened
[2018-05-21T01:49:33.613] sched: Backfill scheduler plugin loaded
[2018-05-21T01:49:33.614] route default plugin loaded
[2018-05-21T01:49:33.614] layouts: loading entities/relations information
[2018-05-21T01:49:33.614] debug: layouts: 3/3 nodes in hash table, rc=0
[2018-05-21T01:49:33.614] debug: layouts: loading stage 1
[2018-05-21T01:49:33.614] debug: layouts: loading stage 1.1 (restore state)
[2018-05-21T01:49:33.614] debug: layouts: loading stage 2
[2018-05-21T01:49:33.614] debug: layouts: loading stage 3
[2018-05-21T01:49:33.614] Recovered state of 3 nodes
[2018-05-21T01:49:33.614] Down nodes: triumph[02-03]
[2018-05-21T01:49:33.614] Recovered information about 0 jobs
[2018-05-21T01:49:33.614] cons_res: select_p_node_init
[2018-05-21T01:49:33.615] cons_res: preparing for 1 partitions
[2018-05-21T01:49:33.615] debug2: init_requeue_policy: kill_invalid_depend is
set to 0
[2018-05-21T01:49:33.615] debug: Updating partition uid access list
[2018-05-21T01:49:33.615] Recovered state of 0 reservations
[2018-05-21T01:49:33.615] State of 0 triggers recovered
[2018-05-21T01:49:33.615] _preserve_plugins: backup_controller not specified
[2018-05-21T01:49:33.615] cons_res: select_p_reconfigure
[2018-05-21T01:49:33.615] cons_res: select_p_node_init
[2018-05-21T01:49:33.615] cons_res: preparing for 1 partitions
[2018-05-21T01:49:33.615] Running as primary controller
[2018-05-21T01:49:33.615] debug: No BackupController, not launching heartbeat.
[2018-05-21T01:49:33.615] debug: Priority BASIC plugin loaded
[2018-05-21T01:49:33.615] No parameter for mcs plugin, default values set
[2018-05-21T01:49:33.615] mcs: MCSParameters = (null). ondemand set.
[2018-05-21T01:49:33.615] debug: mcs none plugin loaded
[2018-05-21T01:49:33.616] debug: power_save mode not enabled
[2018-05-21T01:49:33.616] debug2: slurmctld listening on 0.0.0.0:6817
[2018-05-21T01:49:36.619] debug: Spawning registration agent for
triumph[01-03] 3 hosts
[2018-05-21T01:49:36.619] debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
[2018-05-21T01:49:36.619] debug2: got 1 threads to send out
[2018-05-21T01:49:36.619] debug2: Tree head got back 0 looking for 3
[2018-05-21T01:49:36.620] debug: Munge authentication plugin loaded
[2018-05-21T01:49:36.621] debug2: Tree head got back 1
[2018-05-21T01:49:36.621] debug2: Processing RPC:
MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2018-05-21T01:49:36.621] debug: validate_node_specs: node triumph01
registered with 0 jobs
[2018-05-21T01:49:36.621] debug2: _slurm_rpc_node_registration complete for
triumph01 usec=90
[2018-05-21T01:49:37.620]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2018-05-21T01:49:37.620] debug: sched: Running job scheduler
[2018-05-21T01:49:38.622] debug2: slurm_connect poll timeout: Connection timed
out
[2018-05-21T01:49:38.622] debug2: Error connecting slurm stream socket at
192.168.0.22:6818: Connection timed out
[2018-05-21T01:49:38.622] debug2: slurm_connect poll timeout: Connection timed
out
[2018-05-21T01:49:38.622] debug2: Error connecting slurm stream socket at
192.168.0.23:6818: Connection timed out
[2018-05-21T01:49:38.622] debug2: Tree head got back 2
[2018-05-21T01:49:38.622] debug2: Tree head got back 2
[2018-05-21T01:49:38.622] debug2: Tree head got back 3
[2018-05-21T01:49:38.622] debug2: Tree head got back 3
[2018-05-21T01:49:38.622] agent/is_node_resp: node:triumph02
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2018-05-21T01:49:38.622] agent/is_node_resp: node:triumph03
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2018-05-21T01:49:38.895] debug2: node_did_resp triumph01
[2018-05-21T01:49:38.895] debug2: agent maximum delay 2 seconds
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=triumph01
ControlAddr=triumph01
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=/usr/local/bin/crypto
#JobCredentialPublicCertificate=/home/crypto
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/usr/local/sbin/slurmctld-log
SlurmdDebug=3
SlurmdLogFile=/usr/local/sbin/slurmd-log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=DEFAULT CPUs=64 RealMemory=256000 Sockets=4 CoresPerSocket=8
ThreadsPerCore=2 State=UNKNOWN
NodeName=triumph[01-03]
PartitionName=DEFAULT State=Up
PartitionName=debug Nodes=triumph[01-03] Default=YES MaxTime=INFINITE
[2018-05-21T01:44:03.446] debug: Log file re-opened
[2018-05-21T01:44:03.448] Message aggregation disabled
[2018-05-21T01:44:03.448] topology NONE plugin loaded
[2018-05-21T01:44:03.448] route default plugin loaded
[2018-05-21T01:44:03.449] CPU frequency setting not configured for this node
[2018-05-21T01:44:03.449] debug: Resource spec: No specialized cores
configured by default on this node
[2018-05-21T01:44:03.449] debug: Resource spec: Reserved system memory limit
not configured for this node
[2018-05-21T01:44:03.449] task affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffff
[2018-05-21T01:44:03.450] debug: Munge authentication plugin loaded
[2018-05-21T01:44:03.450] debug: spank: opening plugin stack
/usr/local/etc/plugstack.conf
[2018-05-21T01:44:03.450] Munge cryptographic signature plugin loaded
[2018-05-21T01:44:03.451] Warning: Core limit is only 0 KB
[2018-05-21T01:44:03.451] slurmd version 17.11.6 started
[2018-05-21T01:44:03.452] debug: Job accounting gather NOT_INVOKED plugin
loaded
[2018-05-21T01:44:03.452] debug: job_container none plugin loaded
[2018-05-21T01:44:03.452] debug: switch NONE plugin loaded
[2018-05-21T01:44:03.452] slurmd started on Mon, 21 May 2018 01:44:03 -0500
[2018-05-21T01:44:03.452] CPUs=64 Boards=1 Sockets=4 Cores=8 Threads=2
Memory=258447 TmpDisk=204676 Uptime=43546739 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
[2018-05-21T01:44:03.453] debug: AcctGatherEnergy NONE plugin loaded
[2018-05-21T01:44:03.453] debug: AcctGatherProfile NONE plugin loaded
[2018-05-21T01:44:03.453] debug: AcctGatherInterconnect NONE plugin loaded
[2018-05-21T01:44:03.453] debug: AcctGatherFilesystem NONE plugin loaded
[2018-05-21T01:44:03.453] debug2: Reading acct_gather.conf file
/usr/local/etc/acct_gather.conf
[2018-05-21T01:44:05.455] debug2: slurm_connect poll timeout: Connection timed
out
[2018-05-21T01:44:05.455] debug2: Error connecting slurm stream socket at
192.168.0.21:6817: Connection timed out
[2018-05-21T01:44:05.455] debug: Failed to contact primary controller:
Connection timed out
[2018-05-21T01:44:08.457] debug2: slurm_connect poll timeout: Connection timed
out
[2018-05-21T01:44:08.457] debug2: Error connecting slurm stream socket at
192.168.0.21:6817: Connection timed out
[2018-05-21T01:44:08.457] debug: Failed to contact primary controller:
Connection timed out