Nousheen,
When a node is not responding the first place to start is to ensure that the 
node is up and slurmd is running. It looks like you have confirmed that with 
your output from the command “scontrol show slurmd” so that is a good start. 
After verifying that slurmd is running the next step would be to examine the 
logs. Look at /var/log/slurmd.log on that node and see if it tells you why it 
can’t communicate with the slurm controller. 
Other things to think about since this is a new setup are to make sure the 
network is stable and that DNS is working properly for all nodes. Make sure 
that all nodes in the cluster can do correct DNS resolution of all other nodes 
in the cluster.
 
Mike Robbert
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research 
Computing
Information and Technology Solutions (ITS)
303-273-3786 | mrobb...@mines.edu  

Our values: Trust | Integrity | Respect | Responsibility


 
 
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Nousheen 
<nousheenparv...@gmail.com>
Date: Friday, December 2, 2022 at 09:22
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] [External] ERROR: slurmctld: auth/munge: 
_print_cred: DECODED

CAUTION: This email originated from outside of the Colorado School of Mines 
organization. Do not click on links or open attachments unless you recognize 
the sender and know the content is safe.

 


Dear Ole,

Thank you so much for your response. I have now adjusted the RealMemory in the 
slurm.conf which was set by default previously. Your insight was really 
helpful. Now, when I submit the job, it is running on three nodes but one node 
(104) is not responding. The details of some commands are given below.


[root@nousheen ~]# squeue -j
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
               120     debug   SRBD-1 nousheen  R       0:54      1 101
               121     debug   SRBD-2 nousheen  R       0:54      1 105
               122     debug   SRBD-3 nousheen  R       0:54      1 nousheen
               123     debug   SRBD-4 nousheen  R       0:54      2 105,nousheen
  
  
[root@nousheen ~]# scontrol show nodes
NodeName=101 Arch=x86_64 CoresPerSocket=6 
   CPUAlloc=8 CPUTot=12 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4
   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 
   RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01
   LastBusyTime=2022-12-02T19:58:14
   CfgTRES=cpu=12,mem=31919M,billing=12
   AllocTRES=cpu=8
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=104 Arch=x86_64 CoresPerSocket=6 
   CPUAlloc=0 CPUTot=12 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4
   OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 
   RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1
   State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
   Partitions=debug 
   BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29
   LastBusyTime=2022-12-02T19:58:14
   CfgTRES=cpu=12,mem=31889M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=105 Arch=x86_64 CoresPerSocket=6 
   CPUAlloc=12 CPUTot=12 CPULoad=1.03
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4
   OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 
   RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57
   LastBusyTime=2022-12-02T19:58:14
   CfgTRES=cpu=12,mem=32051M,billing=12
   AllocTRES=cpu=12
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=nousheen Arch=x86_64 CoresPerSocket=6 
   CPUAlloc=12 CPUTot=12 CPULoad=0.32
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 
   RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36
   LastBusyTime=2022-12-02T19:58:15
   CfgTRES=cpu=12,mem=31889M,billing=12
   AllocTRES=cpu=12
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


[root@104 ~]# scontrol show slurmd
Active Steps             = NONE
Actual CPUs              = 12
Actual Boards            = 1
Actual sockets           = 1
Actual cores             = 6
Actual threads per core  = 2
Actual real memory       = 31889 MB
Actual temp disk space   = 106648 MB
Boot time                = 2022-12-02T19:57:29
Hostname                 = 104
Last slurmctld msg time  = NONE
Slurmd PID               = 16906
Slurmd Debug             = 3
Slurmd Logfile           = /var/log/slurmd.log
Version                  = 21.08.4


If you can give me a hint to as what can be the reason behind one node 
nonresponding or what files or problems I should focus on, I would be highly 
grateful to you. Thank you for your time.

Best regards,

Nousheen 






 

ᐧ

 
On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> 
wrote:


Hi Nousheen,

It seems that you have configured incorrectly the nodes in slurm.conf.  I 
notice this:

   RealMemory=1

This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in 
the 1980ies :-)

See how to configure nodes in 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration

You must run "slurmd -C" on each node to determine its actual hardware.

I hope this helps.

/Ole

On 12/1/22 21:08, Nousheen wrote:
> Dear Robbert,
> 
> Thankyou so much for your response. I was so focused on sync of time that 
> I missed the date on one of the nodes which was 1 day behind as you said. 
> I have corrected it and now i get the following output in status.
> 
> *(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l*
> ● slurmctld.service - Slurm controller daemon
>     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor 
> preset: disabled)
>     Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
>   Main PID: 19475 (slurmctld)
>      Tasks: 10
>     Memory: 4.5M
>     CGroup: /system.slice/slurmctld.service
>             ├─19475 /usr/sbin/slurmctld -D -s
>             └─19538 slurmctld: slurmscriptd
> 
> Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: 
> _start_job: Started JobId=106 in debug on 101
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=106 WEXITSTATUS 1
> Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=106 done
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
> JobId=107 NodeList=101 #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
> JobId=108 NodeList=105 #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
> JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=107 WEXITSTATUS 1
> Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=107 done
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=108 WEXITSTATUS 1
> Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: 
> JobId=108 done
> 
> I have total four nodes one of which is the server node. After submitting 
> a job, the job only runs at my server compute node while all the other 
> nodes are IDLE, DOWN or nonresponding. The details are given below:
> 
> *(base) [nousheen@nousheen slurm]$ scontrol show nodes*
> NodeName=101 Arch=x86_64 CoresPerSocket=6
>     CPUAlloc=0 CPUTot=12 CPULoad=0.01
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
>     OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
>     RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
>     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
>     LastBusyTime=2022-12-02T00:58:31
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> NodeName=104 CoresPerSocket=6
>     CPUAlloc=0 CPUTot=12 CPULoad=N/A
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.114 NodeHostName=104
>     RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
>     State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 
> Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=None SlurmdStartTime=None
>     LastBusyTime=2022-12-01T21:37:35
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>     Reason=Not responding [slurm@2022-12-01T16:22:28]
> 
> NodeName=105 Arch=x86_64 CoresPerSocket=6
>     CPUAlloc=0 CPUTot=12 CPULoad=1.08
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
>     OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
>     RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
>     State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
>     LastBusyTime=2022-12-01T21:47:11
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> NodeName=nousheen Arch=x86_64 CoresPerSocket=6
>     CPUAlloc=8 CPUTot=12 CPULoad=6.73
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
>     OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
>     RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
>     State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=debug
>     BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
>     LastBusyTime=2022-12-01T21:37:39
>     CfgTRES=cpu=12,mem=1M,billing=12
>     AllocTRES=cpu=8
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> Where as this command shows only one node on which job is running:
> 
> *(base) [nousheen@nousheen slurm]$ squeue -j*
>               JOBID PARTITION     NAME     USER ST       TIME  NODES 
> NODELIST(REASON)
>                 109     debug   SRBD-4 nousheen  R    3:17:48      1 nousheen
> 
> Can you please guide me as to why my compute nodes are down and not working?
> 
> Thank you for your time.
> 
> 
> Best Regards,
> Nousheen Parvaiz
> 
> 
> ᐧ
> 
> On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobb...@mines.edu 
> <mailto:mrobb...@mines.edu>> wrote:
> 
>     I believe that the error you need to pay attention to for this issue
>     is this line:____
> 
>     __ __
> 
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks____
> 
>     __ __
> 
>     __ __
> 
>     It looks like your compute nodes clock is a full day ahead of your
>     controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
>     sync for munge to work.____
> 
>     __ __
> 
>     *Mike Robbert*____
> 
>     *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
>     Research Computing*____
> 
>     Information and Technology Solutions (ITS)____
> 
>     303-273-3786 | mrobb...@mines.edu <mailto:mrobb...@mines.edu>____
> 
>     A close up of a sign Description automatically generated____
> 
>     *Our values:*Trust | Integrity | Respect | Responsibility____
> 
>     __ __
> 
>     __ __
> 
>     *From: *slurm-users <slurm-users-boun...@lists.schedmd.com
>     <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Nousheen
>     <nousheenparv...@gmail.com <mailto:nousheenparv...@gmail.com>>
>     *Date: *Thursday, December 1, 2022 at 06:19
>     *To: *Slurm User Community List <slurm-users@lists.schedmd.com
>     <mailto:slurm-users@lists.schedmd.com>>
>     *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
>     _print_cred: DECODED____
> 
>     *CAUTION:*This email originated from outside of the Colorado School of
>     Mines organization. Do not click on links or open attachments unless
>     you recognize the sender and know the content is safe.____
> 
>     __ __
> 
>     __ __
> 
>     __ __
> 
>     Hello Everyone,____
> 
>     __ __
> 
>     I am using slurm version 21.08.5 and Centos 7.____
> 
>     __ __
> 
>       I successfully start slurmd on all compute nodes but when I start
>     slurmctld on server node it gives the following error:____
> 
>     __ __
> 
>     *(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l*
>     ● slurmctld.service - Slurm controller daemon
>         Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
>     vendor preset: disabled)
>         Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
>     16min ago
>       Main PID: 1631 (slurmctld)
>          Tasks: 10
>         Memory: 4.0M
>         CGroup: /system.slice/slurmctld.service
>     ├─1631 /usr/sbin/slurmctld -D -s
>                 └─1818 slurmctld: slurmscriptd
> 
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: DECODED: Thu Dec 01 16:17:19 2022
>     Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
>     decode failed: Rewound credential
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: DECODED: Thu Dec 01 16:17:20 2022
>     Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
>     decode failed: Rewound credential
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
>     _print_cred: DECODED: Thu Dec 01 16:17:21 2022
>     Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
>     out of sync clocks____
> 
>     __ __
> 
>     When I run the following command on compute nodes I get the following
>     output:____
> 
>     __ __
> 
>       [gpu101@101 ~]$*munge -n | unmunge*____
> 
>     STATUS:           Success (0)
>     ENCODE_HOST:      ??? (0.0.0.101)
>     ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
>     DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
>     TTL:              300
>     CIPHER:           aes128 (4)
>     MAC:              sha1 (3)
>     ZIP:              none (0)
>     UID:              gpu101 (1000)
>     GID:              gpu101 (1000)
>     LENGTH:           0____
> 
>     __ __
> 
>     Is this error because the encode_host name has question marks and the
>     IP is also not picked correctly by munge. How can I correct this? All
>     the nodes keep non-responding when I run a job. However, I have all
>     the clocks synced across the cluster. ____
> 
>     __ __
> 
>     I am new to slurm. Kindly guide me in this matter.____

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to