I believe that the error you need to pay attention to for this issue is this line: Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks It looks like your compute nodes clock is a full day ahead of your controller node. Dec. 2 instead of Dec. 1. The clocks need to be in sync for munge to work. Mike Robbert Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing Information and Technology Solutions (ITS) 303-273-3786 | mrobb...@mines.edu
Our values: Trust | Integrity | Respect | Responsibility From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Nousheen <nousheenparv...@gmail.com> Date: Thursday, December 1, 2022 at 06:19 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: [External] [slurm-users] ERROR: slurmctld: auth/munge: _print_cred: DECODED CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe. Hello Everyone, I am using slurm version 21.08.5 and Centos 7. I successfully start slurmd on all compute nodes but when I start slurmctld on server node it gives the following error: (base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l ● slurmctld.service - Slurm controller daemon Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h 16min ago Main PID: 1631 (slurmctld) Tasks: 10 Memory: 4.0M CGroup: /system.slice/slurmctld.service ├─1631 /usr/sbin/slurmctld -D -s └─1818 slurmctld: slurmscriptd Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: DECODED: Thu Dec 01 16:17:19 2022 Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge decode failed: Rewound credential Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: ENCODED: Fri Dec 02 16:16:55 2022 Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: DECODED: Thu Dec 01 16:17:20 2022 Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge decode failed: Rewound credential Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: ENCODED: Fri Dec 02 16:16:56 2022 Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: _print_cred: DECODED: Thu Dec 01 16:17:21 2022 Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks When I run the following command on compute nodes I get the following output: [gpu101@101 ~]$ munge -n | unmunge STATUS: Success (0) ENCODE_HOST: ??? (0.0.0.101) ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818) DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: gpu101 (1000) GID: gpu101 (1000) LENGTH: 0 Is this error because the encode_host name has question marks and the IP is also not picked correctly by munge. How can I correct this? All the nodes keep non-responding when I run a job. However, I have all the clocks synced across the cluster. I am new to slurm. Kindly guide me in this matter. Best Regards, Nousheen Parvaiz Ph.D. Scholar ᐧ
smime.p7s
Description: S/MIME cryptographic signature