Never mind, I found the problem. The rebuilt nodes were still listed in my 
other cluster config (running Slurm 19), and hence it was sending them status 
check messages which they couldn't respond to. Tidied up the config and the 
messages have disappeared.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Mark 
Holliman
Sent: 29 November 2022 11:53
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [slurm-users] protocol_version 8960 not supported

Hello,

I've just finished building and installing Slurm 22.05.6 from source on a head 
node and a couple workers. I installed the same RPMs on all the nodes, and the 
slurmdbd, slurmctld, and slurmd daemons have all come online and appear healthy 
(test jobs can be submitted to partitions and successfully run on the nodes). 
But I'm seeing these errors at regular intervals in the slurm logs:

[2022-11-29T11:29:49.683] error: unpack_header: protocol_version 8960 not 
supported
[2022-11-29T11:29:49.683] error: unpacking header
[2022-11-29T11:29:49.683] error: destroy_forward: no init
[2022-11-29T11:29:49.684] error: slurm_receive_msg_and_forward: 
[[sdc-uk]:53026] failed: Message receive failure
[2022-11-29T11:29:49.694] error: service_connection: slurm_receive_msg: Message 
receive failure

My slurm.conf is based on my previous (still existing) cluster config, and I've 
already encountered one or two issues with plugins not working. I can't find 
anything online listing the Slurm protocol_version numbers to check what is 
causing this error, though I'm assuming it's plugin related (slurmdbd maybe?). 
Turning up the debugging on the slurm logs doesn't help at finding the issue. 
Does anyone here know what protocol_verson 8960 relates to?

Relevant slurm.conf lines are:

MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
# Job cleanup
Epilog=/etc/slurm/slurm.epilog.clean
UnkillableStepTimeout=120
UnkillableStepProgram=/root/unkillableJobStepScript.sh
# SCHEDULING
#FastSchedule=0
SchedulerType=sched/backfill
SchedulerParameters=nohold_on_prolog_fail
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityWeightPartition=1000
PreemptMode=SUSPEND,GANG
PreemptType=preempt/partition_prio
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
JobCompType=jobcomp/none
JobAcctGatherFrequency=40
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurm/slurmd.log


Cheers,
  Mark

-------------------------------
Mark Holliman
Senior Data Systems Specialist
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------------------------------
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.


Reply via email to