[slurm-users] network/communication failure

Turner, Heath Mon, 21 May 2018 08:09:35 -0700

If anyone has advice, I would really appreciate...

I am running (just installed) slurm-11.17.6, with a master + 2 hosts.  It works 
locally on the master (controller + execution).  However, I cannot establish 
communication from master [triumph01] with the 2 hosts [triumph02,triumph03].  
Here is some more info:


1. munge is running, and munge verification tests all pass.
2. system clocks are in sync on master/hosts.
3. identical slurm.conf files are on master/hosts.
4. configuration of resources (memory/cpus/etc) are correct and have been 
confirmed on all machines (all hardware is identical).
5. I have attached:
        a) slurm.conf
        b) log file from master slurmctld
        c) log file from host slurmd

Any ideas about what to try next?

Heath Turner

Professor
Graduate Coordinator
Chemical and Biological Engineering
http://che.eng.ua.edu
 
University of Alabama
3448 SEC, Box 870203
Tuscaloosa, AL  35487
(205) 348-1733 (phone)
(205) 561-7450 (cell)
(205) 348-7558 (fax)
htur...@eng.ua.edu
http://turnerresearchgroup.ua.edu

[2018-05-21T01:49:33.608] debug:  Log file re-opened
[2018-05-21T01:49:33.609] debug:  sched: slurmctld starting
[2018-05-21T01:49:33.609] Warning: Core limit is only 0 KB
[2018-05-21T01:49:33.609] slurmctld version 17.11.6 started on cluster cluster
[2018-05-21T01:49:33.610] Munge cryptographic signature plugin loaded
[2018-05-21T01:49:33.610] Consumable Resources (CR) Node Selection plugin 
loaded with argument 4
[2018-05-21T01:49:33.610] preempt/none loaded
[2018-05-21T01:49:33.610] debug:  Checkpoint plugin loaded: checkpoint/none
[2018-05-21T01:49:33.610] debug:  AcctGatherEnergy NONE plugin loaded
[2018-05-21T01:49:33.610] debug:  AcctGatherProfile NONE plugin loaded
[2018-05-21T01:49:33.610] debug:  AcctGatherInterconnect NONE plugin loaded
[2018-05-21T01:49:33.610] debug:  AcctGatherFilesystem NONE plugin loaded
[2018-05-21T01:49:33.610] debug2: No acct_gather.conf file 
(/usr/local/etc/acct_gather.conf)
[2018-05-21T01:49:33.610] debug:  Job accounting gather NOT_INVOKED plugin 
loaded
[2018-05-21T01:49:33.610] ExtSensors NONE plugin loaded
[2018-05-21T01:49:33.610] debug:  switch NONE plugin loaded
[2018-05-21T01:49:33.610] debug:  power_save module disabled, SuspendTime < 0
[2018-05-21T01:49:33.611] debug:  No backup controller to shutdown
[2018-05-21T01:49:33.611] Accounting storage NOT INVOKED plugin loaded
[2018-05-21T01:49:33.611] debug:  Reading slurm.conf file: 
/usr/local/etc/slurm.conf
[2018-05-21T01:49:33.612] layouts: no layout to initialize
[2018-05-21T01:49:33.612] topology NONE plugin loaded
[2018-05-21T01:49:33.612] debug:  No DownNodes
[2018-05-21T01:49:33.613] debug:  Log file re-opened
[2018-05-21T01:49:33.613] sched: Backfill scheduler plugin loaded
[2018-05-21T01:49:33.614] route default plugin loaded
[2018-05-21T01:49:33.614] layouts: loading entities/relations information
[2018-05-21T01:49:33.614] debug:  layouts: 3/3 nodes in hash table, rc=0
[2018-05-21T01:49:33.614] debug:  layouts: loading stage 1
[2018-05-21T01:49:33.614] debug:  layouts: loading stage 1.1 (restore state)
[2018-05-21T01:49:33.614] debug:  layouts: loading stage 2
[2018-05-21T01:49:33.614] debug:  layouts: loading stage 3
[2018-05-21T01:49:33.614] Recovered state of 3 nodes
[2018-05-21T01:49:33.614] Down nodes: triumph[02-03]
[2018-05-21T01:49:33.614] Recovered information about 0 jobs
[2018-05-21T01:49:33.614] cons_res: select_p_node_init
[2018-05-21T01:49:33.615] cons_res: preparing for 1 partitions
[2018-05-21T01:49:33.615] debug2: init_requeue_policy: kill_invalid_depend is 
set to 0
[2018-05-21T01:49:33.615] debug:  Updating partition uid access list
[2018-05-21T01:49:33.615] Recovered state of 0 reservations
[2018-05-21T01:49:33.615] State of 0 triggers recovered
[2018-05-21T01:49:33.615] _preserve_plugins: backup_controller not specified
[2018-05-21T01:49:33.615] cons_res: select_p_reconfigure
[2018-05-21T01:49:33.615] cons_res: select_p_node_init
[2018-05-21T01:49:33.615] cons_res: preparing for 1 partitions
[2018-05-21T01:49:33.615] Running as primary controller
[2018-05-21T01:49:33.615] debug:  No BackupController, not launching heartbeat.
[2018-05-21T01:49:33.615] debug:  Priority BASIC plugin loaded
[2018-05-21T01:49:33.615] No parameter for mcs plugin, default values set
[2018-05-21T01:49:33.615] mcs: MCSParameters = (null). ondemand set.
[2018-05-21T01:49:33.615] debug:  mcs none plugin loaded
[2018-05-21T01:49:33.616] debug:  power_save mode not enabled
[2018-05-21T01:49:33.616] debug2: slurmctld listening on 0.0.0.0:6817
[2018-05-21T01:49:36.619] debug:  Spawning registration agent for 
triumph[01-03] 3 hosts
[2018-05-21T01:49:36.619] debug2: Spawning RPC agent for msg_type 
REQUEST_NODE_REGISTRATION_STATUS
[2018-05-21T01:49:36.619] debug2: got 1 threads to send out
[2018-05-21T01:49:36.619] debug2: Tree head got back 0 looking for 3
[2018-05-21T01:49:36.620] debug:  Munge authentication plugin loaded
[2018-05-21T01:49:36.621] debug2: Tree head got back 1
[2018-05-21T01:49:36.621] debug2: Processing RPC: 
MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2018-05-21T01:49:36.621] debug:  validate_node_specs: node triumph01 
registered with 0 jobs
[2018-05-21T01:49:36.621] debug2: _slurm_rpc_node_registration complete for 
triumph01 usec=90
[2018-05-21T01:49:37.620] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2018-05-21T01:49:37.620] debug:  sched: Running job scheduler
[2018-05-21T01:49:38.622] debug2: slurm_connect poll timeout: Connection timed 
out
[2018-05-21T01:49:38.622] debug2: Error connecting slurm stream socket at 
192.168.0.22:6818: Connection timed out
[2018-05-21T01:49:38.622] debug2: slurm_connect poll timeout: Connection timed 
out
[2018-05-21T01:49:38.622] debug2: Error connecting slurm stream socket at 
192.168.0.23:6818: Connection timed out
[2018-05-21T01:49:38.622] debug2: Tree head got back 2
[2018-05-21T01:49:38.622] debug2: Tree head got back 2
[2018-05-21T01:49:38.622] debug2: Tree head got back 3
[2018-05-21T01:49:38.622] debug2: Tree head got back 3
[2018-05-21T01:49:38.622] agent/is_node_resp: node:triumph02 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2018-05-21T01:49:38.622] agent/is_node_resp: node:triumph03 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2018-05-21T01:49:38.895] debug2: node_did_resp triumph01
[2018-05-21T01:49:38.895] debug2: agent maximum delay 2 seconds

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=triumph01
ControlAddr=triumph01
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=/usr/local/bin/crypto
#JobCredentialPublicCertificate=/home/crypto
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/usr/local/sbin/slurmctld-log
SlurmdDebug=3
SlurmdLogFile=/usr/local/sbin/slurmd-log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=DEFAULT CPUs=64 RealMemory=256000 Sockets=4 CoresPerSocket=8 
ThreadsPerCore=2 State=UNKNOWN
NodeName=triumph[01-03]
PartitionName=DEFAULT State=Up
PartitionName=debug Nodes=triumph[01-03] Default=YES MaxTime=INFINITE

[2018-05-21T01:44:03.446] debug:  Log file re-opened
[2018-05-21T01:44:03.448] Message aggregation disabled
[2018-05-21T01:44:03.448] topology NONE plugin loaded
[2018-05-21T01:44:03.448] route default plugin loaded
[2018-05-21T01:44:03.449] CPU frequency setting not configured for this node
[2018-05-21T01:44:03.449] debug:  Resource spec: No specialized cores 
configured by default on this node
[2018-05-21T01:44:03.449] debug:  Resource spec: Reserved system memory limit 
not configured for this node
[2018-05-21T01:44:03.449] task affinity plugin loaded with CPU mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffff
[2018-05-21T01:44:03.450] debug:  Munge authentication plugin loaded
[2018-05-21T01:44:03.450] debug:  spank: opening plugin stack 
/usr/local/etc/plugstack.conf
[2018-05-21T01:44:03.450] Munge cryptographic signature plugin loaded
[2018-05-21T01:44:03.451] Warning: Core limit is only 0 KB
[2018-05-21T01:44:03.451] slurmd version 17.11.6 started
[2018-05-21T01:44:03.452] debug:  Job accounting gather NOT_INVOKED plugin 
loaded
[2018-05-21T01:44:03.452] debug:  job_container none plugin loaded
[2018-05-21T01:44:03.452] debug:  switch NONE plugin loaded
[2018-05-21T01:44:03.452] slurmd started on Mon, 21 May 2018 01:44:03 -0500
[2018-05-21T01:44:03.452] CPUs=64 Boards=1 Sockets=4 Cores=8 Threads=2 
Memory=258447 TmpDisk=204676 Uptime=43546739 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)
[2018-05-21T01:44:03.453] debug:  AcctGatherEnergy NONE plugin loaded
[2018-05-21T01:44:03.453] debug:  AcctGatherProfile NONE plugin loaded
[2018-05-21T01:44:03.453] debug:  AcctGatherInterconnect NONE plugin loaded
[2018-05-21T01:44:03.453] debug:  AcctGatherFilesystem NONE plugin loaded
[2018-05-21T01:44:03.453] debug2: Reading acct_gather.conf file 
/usr/local/etc/acct_gather.conf
[2018-05-21T01:44:05.455] debug2: slurm_connect poll timeout: Connection timed 
out
[2018-05-21T01:44:05.455] debug2: Error connecting slurm stream socket at 
192.168.0.21:6817: Connection timed out
[2018-05-21T01:44:05.455] debug:  Failed to contact primary controller: 
Connection timed out
[2018-05-21T01:44:08.457] debug2: slurm_connect poll timeout: Connection timed 
out
[2018-05-21T01:44:08.457] debug2: Error connecting slurm stream socket at 
192.168.0.21:6817: Connection timed out
[2018-05-21T01:44:08.457] debug:  Failed to contact primary controller: 
Connection timed out

[slurm-users] network/communication failure

Reply via email to