Re: [slurm-users] [External] Munge thinks clocks aren't synced

Prentice Bisbal Thu, 29 Oct 2020 11:59:11 -0700

Having the head node run as an NTP server is a good idea. I set up myclusters the same way. Is it possible that ntp.conf on the head node hasa restrict statement that restricts access to it by IP address/range,which is why this one node on a different network can't reach it?

It sounds like it's working now, but I don't understand why ntpdatewould give you that error unless it couldn't reach ntpd on the head node.


Prentice


On 10/27/20 4:58 PM, Gard Nelson wrote:

Thanks for your help, Prentice.
Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted andchecked that chronyd is disabled. ntpd is running. The rest of thecluster uses centos 7.5 and ntp so it’s possible, although maybe notideal.
I’m running ntpq on the new compute node. It is looking to the slurmhead node which is also set up as the ntp server. Here’s the output:
[root ~]# ntpq -p

remote           refid      st t when poll reach   delay offset  jitter

==============================================================================

HEADNODE_IP .XFAC.          16 u    - 1024    0    0.000    0.000 0.000
It was a bit of a pain to get set up. The time difference was severalhours so ntp would have taken ages to fix on its own. I have usedntpdate successfully on the existing compute nodes, but got a “noserver suitable for synchronization found” error here. ‘ntpd -gqx’timed out. So in order to set the time, I had to point ntp to thedefault centos pool of ntp servers to set the time and then point itback to the headnode. After that, ‘ntpd -gqx’ ran smoothly and Iassume (based on the ntpq output) that it worked. Running ‘date’ onthe new compute and existing head node simultaneously returns the sametime to within ~1 sec rather than the 7:30 gap from the log file.
Not sure if it’s relevant to this problem, but the new compute node ison a different subnet connected to a different port than the existingcompute nodes. This is the first time that I’ve set up a node on adifferent subnet. I figured it be simple to point slurm to the newnode, but I didn’t anticipate ntp and munge issues.
Thanks,

Gard
*From: *slurm-users <[email protected]> on behalfof Prentice Bisbal <[email protected]>
*Reply-To: *Slurm User Community List <[email protected]>
*Date: *Tuesday, October 27, 2020 at 12:22 PM
*To: *"[email protected]" <[email protected]>
*Subject: *Re: [slurm-users] [External] Munge thinks clocks aren't synced
You don't specify what OS or version you're using. If you're usingRHEL 7 or a derivative, chrony is used by default over ntpd, so therecould be some confusion between chronyd and ntpd. If you haven't doneso already, I'd check to see which daemon is actually running on yoursystem.
Can you share the complete output of ntpq -p with us, and let us knowwhat nodes the output is from? You might want to run 'ntpdate' beforestarting ntpd. If the clocks are too far off, either ntpd won'tcorrect the time, or it will take a long time. ntpdate immediatelysyncs up the time between servers.
I would make sure ntpdate is installed and enabled, then reboot bothcompute nodes. This will make sure that ntpdate is called at startupbefore ntpd, and will then make sure all start using the correct time.
--
Prentice

On 10/27/20 2:08 PM, Gard Nelson wrote:

    Hi everyone,

    I’m adding a new node to an existing cluster. After installing
    slurm and the prereqs, I synced the clocks with ntpd. When I run
    ‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the slurm head
    node is also the ntp server) ‘date’ also gives me identical times
    for the head and compute nodes. However, when I start slurmd, I
    get a munge error about the clocks being out of sync. From the
    slurmctld log:

    [2020-10-27T11:02:06.511] node NEW_NODE returned to service

    [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound
    credential

    [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020

    [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020

    [2020-10-27T11:02:07.265] error: Check for out of sync clocks

    [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
    MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Rewound
    credential

    [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
    Protocol authentication error

    [2020-10-27T11:02:07.275] error: slurm_receive_msg
    [HEAD_NODE_IP:PORT]: Unspecified error

    I restarted ntp, munge and the slurm daemons on both nodes before
    this last error was generated. Any idea what’s going on here?

    Thanks,

    Gard


              CONFIDENTIALITY NOTICE
              This e-mail message and any attachments are only for the
              use of the intended recipient and may contain
              information that is privileged, confidential or exempt
              from disclosure under applicable law. If you are not the
              intended recipient, any disclosure, distribution or
              other use of this e-mail message or attachments is
              prohibited. If you have received this e-mail message in
              error, please delete and notify the sender immediately.
              Thank you.

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov  
<https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [slurm-users] [External] Munge thinks clocks aren't synced

Reply via email to