I see potentially 2 things you should likely do:
1. Run ntpd on your nodes. You can even have them sync with your master.
2. Sync your user data on the nodes too. Even if that is just ensuring
/etc/passwd and /etc/group are the same on them all
While ntp is not required for slurm, the time sync is very important and
ntp makes that a non-issue. Best practices and all.
#2 is something that is often overlooked, but obvious when you think
about it.
I have seen folks add users my doing 'useradd' on each node, but that
messes everything up if you installed a package or such that changed the
next uid on any node.
The error below looks like you may have a different uid for the slurm
user on the node. What uid is slurmd running as on the bad node vs a
good node?
Brian Andrus
On 4/17/2020 2:38 PM, Dean Schulze wrote:
Just noticed this. On the problem node the munged.log file has an
entry every 1:40:
2020-04-17 15:31:02 -0600 Info: Invalid credential
2020-04-17 15:32:42 -0600 Info: Invalid credential
2020-04-17 15:34:22 -0600 Info: Invalid credential
This happens on the failed node and two other nodes that work. Two
nodes that work (including the controller) don't have this message.
On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <andy.ri...@hpe.com
<mailto:andy.ri...@hpe.com>> wrote:
A couple of quick checks to see if the problem is munge:
1.On the problem node, try
$ echo foo | munge | unmunge
2.If (1) works, try this from the node running slurmctld to the
problem node
slurm-node$ echo foo | ssh node munge | unmunge
*From:*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com
<mailto:slurm-users-boun...@lists.schedmd.com>] *On Behalf Of
*Dean Schulze
*Sent:* Friday, April 17, 2020 3:40 PM
*To:* Slurm User Community List <slurm-users@lists.schedmd.com
<mailto:slurm-users@lists.schedmd.com>>
*Subject:* Re: [slurm-users] Munge decode failing on new node
There is no ntp service running on any of my nodes, and all but
this one is working. I haven't heard that ntp is a requirement
for slurm, just that the time be synchronized across the cluster.
And it is.
On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy <mini...@gmail.com
<mailto:mini...@gmail.com>> wrote:
I’d check ntp as your encoding time seems odd to me
On Wed, 15 Apr 2020 at 19:59, Dean Schulze
<dean.w.schu...@gmail.com <mailto:dean.w.schu...@gmail.com>>
wrote:
I've installed two new nodes onto my slurm cluster. One
node works, but the other one complains about an invalid
credential for munge. I've verified that the munge.key is
the same as on all other nodes with
sudo cksum /etc/munge/munge.key
I recopied a munge.key from a node that works. I've
verified that munge uid and gid are the same on the
nodes. The time is in sync on all nodes.
Here is what is in the slurmd.log:
error: Unable to register: Unable to contact slurm
controller (connect failure)
error: Munge decode failed: Invalid credential
ENCODED: Wed Dec 31 17:00:00 1969
DECODED: Wed Dec 31 17:00:00 1969
error: authentication: Invalid authentication credential
error: slurm_receive_msg_and_forward: Protocol
authentication error
error: service_connection: slurm_receive_msg: Protocol
authentication error
error: Unable to register: Unable to contact slurm
controller (connect failure)
I've checked in the munged.log and all it says is
Invalid credential
Thanks for your help
--
--
Carles Fenoy