The uid and gid are the same for the slurm and munge users on each node.  The 
two new nodes, one of which can’t connect with the controller, have the same 
users and were created with the same sequence of steps.  The only exception is 
that the node that won’t connect has the software stack to compile slurm 
installed on it.  I’ll try removing these packages and see if that makes any 
difference.

 

I was wrong about the nodes not having ntp.  They are all running 
systemd-timesyncd.

 

I’ve found something interesting and inconsistent on the nodes that I’ll post 
in a new thread since this one is going nowhere.

 

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Brian 
Andrus
Sent: Sunday, April 19, 2020 9:30 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Munge decode failing on new node

 

I see potentially 2 things you should likely do:

1.      Run ntpd on your nodes. You can even have them sync with your master. 
2.      Sync your user data on the nodes too. Even if that is just ensuring 
/etc/passwd and /etc/group are the same on them all

While ntp is not required for slurm, the time sync is very important and ntp 
makes that a non-issue. Best practices and all.

#2 is something that is often overlooked, but obvious when you think about it.
I have seen folks add users my doing 'useradd' on each node, but that messes 
everything up if you installed a package or such that changed the next uid on 
any node.

The error below looks like you may have a different uid for the slurm user on 
the node. What uid is slurmd running as on the bad node vs a good node?

Brian Andrus

 

On 4/17/2020 2:38 PM, Dean Schulze wrote:

Just noticed this.  On the problem node the munged.log file has an entry every 
1:40: 

 

2020-04-17 15:31:02 -0600 Info:      Invalid credential
2020-04-17 15:32:42 -0600 Info:      Invalid credential
2020-04-17 15:34:22 -0600 Info:      Invalid credential

 

This happens on the failed node and two other nodes that work.  Two nodes that 
work (including the controller) don't have this message.

 

 

 

On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <andy.ri...@hpe.com 
<mailto:andy.ri...@hpe.com> > wrote:

A couple of quick checks to see if the problem is munge:

1.       On the problem node, try
$ echo foo | munge | unmunge

2.       If (1) works, try this from the node running slurmctld to the problem 
node
slurm-node$ echo foo | ssh node munge | unmunge

 

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com 
<mailto:slurm-users-boun...@lists.schedmd.com> ] On Behalf Of Dean Schulze
Sent: Friday, April 17, 2020 3:40 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com 
<mailto:slurm-users@lists.schedmd.com> >
Subject: Re: [slurm-users] Munge decode failing on new node

 

There is no ntp service running on any of my nodes, and all but this one is 
working.  I haven't heard that ntp is a requirement for slurm, just that the 
time be synchronized across the cluster.  And it is.

 

On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy <mini...@gmail.com 
<mailto:mini...@gmail.com> > wrote:

I’d check ntp as your encoding time seems odd to me

 

On Wed, 15 Apr 2020 at 19:59, Dean Schulze <dean.w.schu...@gmail.com 
<mailto:dean.w.schu...@gmail.com> > wrote:

I've installed two new nodes onto my slurm cluster.  One node works, but the 
other one complains about an invalid credential for munge.  I've verified that 
the munge.key is the same as on all other nodes with


sudo cksum /etc/munge/munge.key

 

I recopied a munge.key from a node that works.  I've verified that munge uid 
and gid are the same on the nodes.  The time is in sync on all nodes. 

 

Here is what is in the slurmd.log:

 

 error: Unable to register: Unable to contact slurm controller (connect failure)
 error: Munge decode failed: Invalid credential
 ENCODED: Wed Dec 31 17:00:00 1969
 DECODED: Wed Dec 31 17:00:00 1969
 error: authentication: Invalid authentication credential
 error: slurm_receive_msg_and_forward: Protocol authentication error
 error: service_connection: slurm_receive_msg: Protocol authentication error
 error: Unable to register: Unable to contact slurm controller (connect failure)

 

I've checked in the munged.log and all it says is 

 

Invalid credential 

 

Thanks for your help

-- 

--
Carles Fenoy

Reply via email to