Dear Ole, Thank you so much for your response. I have now adjusted the RealMemory in the slurm.conf which was set by default previously. Your insight was really helpful. Now, when I submit the job, it is running on three nodes but one node (104) is not responding. The details of some commands are given below.
*[root@nousheen ~]# squeue -j* JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 120 debug SRBD-1 nousheen R 0:54 1 101 121 debug SRBD-2 nousheen R 0:54 1 105 122 debug SRBD-3 nousheen R 0:54 1 nousheen 123 debug SRBD-4 nousheen R 0:54 2 105,nousheen *[root@nousheen ~]# scontrol show nodes* NodeName=101 Arch=x86_64 CoresPerSocket=6 CPUAlloc=8 CPUTot=12 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.118 NodeHostName=101 Version=21.08.4 OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 RealMemory=31919 AllocMem=0 FreeMem=293 Sockets=1 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-02T19:56:01 LastBusyTime=2022-12-02T19:58:14 CfgTRES=cpu=12,mem=31919M,billing=12 AllocTRES=cpu=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=104 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUTot=12 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.104 NodeHostName=104 Version=21.08.4 OS=Linux 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 RealMemory=31889 AllocMem=0 FreeMem=30433 Sockets=1 Boards=1 State=IDLE+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-11-24T11:15:43 SlurmdStartTime=2022-12-02T19:57:29 LastBusyTime=2022-12-02T19:58:14 CfgTRES=cpu=12,mem=31889M,billing=12 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=105 Arch=x86_64 CoresPerSocket=6 CPUAlloc=12 CPUTot=12 CPULoad=1.03 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.105 NodeHostName=105 Version=21.08.4 OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 RealMemory=32051 AllocMem=0 FreeMem=14874 Sockets=1 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-02T19:56:57 LastBusyTime=2022-12-02T19:58:14 CfgTRES=cpu=12,mem=32051M,billing=12 AllocTRES=cpu=12 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=nousheen Arch=x86_64 CoresPerSocket=6 CPUAlloc=12 CPUTot=12 CPULoad=0.32 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=192.168.60.194 NodeHostName=nousheen Version=21.08.5 OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 RealMemory=31889 AllocMem=0 FreeMem=16666 Sockets=1 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2022-12-01T12:00:18 SlurmdStartTime=2022-12-02T19:56:36 LastBusyTime=2022-12-02T19:58:15 CfgTRES=cpu=12,mem=31889M,billing=12 AllocTRES=cpu=12 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s *[root@104 ~]# scontrol show slurmd* Active Steps = NONE Actual CPUs = 12 Actual Boards = 1 Actual sockets = 1 Actual cores = 6 Actual threads per core = 2 Actual real memory = 31889 MB Actual temp disk space = 106648 MB Boot time = 2022-12-02T19:57:29 Hostname = 104 Last slurmctld msg time = NONE Slurmd PID = 16906 Slurmd Debug = 3 Slurmd Logfile = /var/log/slurmd.log Version = 21.08.4 If you can give me a hint to as what can be the reason behind one node nonresponding or what files or problems I should focus on, I would be highly grateful to you. Thank you for your time. Best regards, Nousheen ᐧ On Fri, Dec 2, 2022 at 11:56 AM Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote: > Hi Nousheen, > > It seems that you have configured incorrectly the nodes in slurm.conf. I > notice this: > > RealMemory=1 > > This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in > the 1980ies :-) > > See how to configure nodes in > > https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration > > You must run "slurmd -C" on each node to determine its actual hardware. > > I hope this helps. > > /Ole > > On 12/1/22 21:08, Nousheen wrote: > > Dear Robbert, > > > > Thankyou so much for your response. I was so focused on sync of time > that > > I missed the date on one of the nodes which was 1 day behind as you > said. > > I have corrected it and now i get the following output in status. > > > > *(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service > -l* > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; > vendor > > preset: disabled) > > Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago > > Main PID: 19475 (slurmctld) > > Tasks: 10 > > Memory: 4.5M > > CGroup: /system.slice/slurmctld.service > > ├─19475 /usr/sbin/slurmctld -D -s > > └─19538 slurmctld: slurmscriptd > > > > Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: > > _start_job: Started JobId=106 in debug on 101 > > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: > > JobId=106 WEXITSTATUS 1 > > Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: > > JobId=106 done > > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate > > JobId=107 NodeList=101 #CPUs=8 Partition=debug > > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate > > JobId=108 NodeList=105 #CPUs=8 Partition=debug > > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate > > JobId=109 NodeList=nousheen #CPUs=8 Partition=debug > > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: > > JobId=107 WEXITSTATUS 1 > > Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: > > JobId=107 done > > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: > > JobId=108 WEXITSTATUS 1 > > Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: > > JobId=108 done > > > > I have total four nodes one of which is the server node. After > submitting > > a job, the job only runs at my server compute node while all the other > > nodes are IDLE, DOWN or nonresponding. The details are given below: > > > > *(base) [nousheen@nousheen slurm]$ scontrol show nodes* > > NodeName=101 Arch=x86_64 CoresPerSocket=6 > > CPUAlloc=0 CPUTot=12 CPULoad=0.01 > > AvailableFeatures=(null) > > ActiveFeatures=(null) > > Gres=(null) > > NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4 > > OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC > 2022 > > RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1 > > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > > Partitions=debug > > BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57 > > LastBusyTime=2022-12-02T00:58:31 > > CfgTRES=cpu=12,mem=1M,billing=12 > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > NodeName=104 CoresPerSocket=6 > > CPUAlloc=0 CPUTot=12 CPULoad=N/A > > AvailableFeatures=(null) > > ActiveFeatures=(null) > > Gres=(null) > > NodeAddr=192.168.60.114 NodeHostName=104 > > RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1 > > State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 > > Owner=N/A MCS_label=N/A > > Partitions=debug > > BootTime=None SlurmdStartTime=None > > LastBusyTime=2022-12-01T21:37:35 > > CfgTRES=cpu=12,mem=1M,billing=12 > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > Reason=Not responding [slurm@2022-12-01T16:22:28] > > > > NodeName=105 Arch=x86_64 CoresPerSocket=6 > > CPUAlloc=0 CPUTot=12 CPULoad=1.08 > > AvailableFeatures=(null) > > ActiveFeatures=(null) > > Gres=(null) > > NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4 > > OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC > 2022 > > RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1 > > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > > Partitions=debug > > BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30 > > LastBusyTime=2022-12-01T21:47:11 > > CfgTRES=cpu=12,mem=1M,billing=12 > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > NodeName=nousheen Arch=x86_64 CoresPerSocket=6 > > CPUAlloc=8 CPUTot=12 CPULoad=6.73 > > AvailableFeatures=(null) > > ActiveFeatures=(null) > > Gres=(null) > > NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5 > > OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC > 2021 > > RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1 > > State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > > Partitions=debug > > BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42 > > LastBusyTime=2022-12-01T21:37:39 > > CfgTRES=cpu=12,mem=1M,billing=12 > > AllocTRES=cpu=8 > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > Where as this command shows only one node on which job is running: > > > > *(base) [nousheen@nousheen slurm]$ squeue -j* > > JOBID PARTITION NAME USER ST TIME NODES > > NODELIST(REASON) > > 109 debug SRBD-4 nousheen R 3:17:48 1 > nousheen > > > > Can you please guide me as to why my compute nodes are down and not > working? > > > > Thank you for your time. > > > > > > Best Regards, > > Nousheen Parvaiz > > > > > > ᐧ > > > > On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobb...@mines.edu > > <mailto:mrobb...@mines.edu>> wrote: > > > > I believe that the error you need to pay attention to for this issue > > is this line:____ > > > > __ __ > > > > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for > > out of sync clocks____ > > > > __ __ > > > > __ __ > > > > It looks like your compute nodes clock is a full day ahead of your > > controller node. Dec. 2 instead of Dec. 1. The clocks need to be in > > sync for munge to work.____ > > > > __ __ > > > > *Mike Robbert*____ > > > > *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced > > Research Computing*____ > > > > Information and Technology Solutions (ITS)____ > > > > 303-273-3786 | mrobb...@mines.edu <mailto:mrobb...@mines.edu>____ > > > > A close up of a sign Description automatically generated____ > > > > *Our values:*Trust | Integrity | Respect | Responsibility____ > > > > __ __ > > > > __ __ > > > > *From: *slurm-users <slurm-users-boun...@lists.schedmd.com > > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of > Nousheen > > <nousheenparv...@gmail.com <mailto:nousheenparv...@gmail.com>> > > *Date: *Thursday, December 1, 2022 at 06:19 > > *To: *Slurm User Community List <slurm-users@lists.schedmd.com > > <mailto:slurm-users@lists.schedmd.com>> > > *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge: > > _print_cred: DECODED____ > > > > *CAUTION:*This email originated from outside of the Colorado School > of > > Mines organization. Do not click on links or open attachments unless > > you recognize the sender and know the content is safe.____ > > > > __ __ > > > > __ __ > > > > __ __ > > > > Hello Everyone,____ > > > > __ __ > > > > I am using slurm version 21.08.5 and Centos 7.____ > > > > __ __ > > > > I successfully start slurmd on all compute nodes but when I start > > slurmctld on server node it gives the following error:____ > > > > __ __ > > > > *(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service > -l* > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; > > vendor preset: disabled) > > Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h > > 16min ago > > Main PID: 1631 (slurmctld) > > Tasks: 10 > > Memory: 4.0M > > CGroup: /system.slice/slurmctld.service > > ├─1631 /usr/sbin/slurmctld -D -s > > └─1818 slurmctld: slurmscriptd > > > > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge: > > _print_cred: DECODED: Thu Dec 01 16:17:19 2022 > > Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for > > out of sync clocks > > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge > > decode failed: Rewound credential > > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: > > _print_cred: ENCODED: Fri Dec 02 16:16:55 2022 > > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge: > > _print_cred: DECODED: Thu Dec 01 16:17:20 2022 > > Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for > > out of sync clocks > > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge > > decode failed: Rewound credential > > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: > > _print_cred: ENCODED: Fri Dec 02 16:16:56 2022 > > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge: > > _print_cred: DECODED: Thu Dec 01 16:17:21 2022 > > Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for > > out of sync clocks____ > > > > __ __ > > > > When I run the following command on compute nodes I get the following > > output:____ > > > > __ __ > > > > [gpu101@101 ~]$*munge -n | unmunge*____ > > > > STATUS: Success (0) > > ENCODE_HOST: ??? (0.0.0.101) > > ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818) > > DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818) > > TTL: 300 > > CIPHER: aes128 (4) > > MAC: sha1 (3) > > ZIP: none (0) > > UID: gpu101 (1000) > > GID: gpu101 (1000) > > LENGTH: 0____ > > > > __ __ > > > > Is this error because the encode_host name has question marks and the > > IP is also not picked correctly by munge. How can I correct this? All > > the nodes keep non-responding when I run a job. However, I have all > > the clocks synced across the cluster. ____ > > > > __ __ > > > > I am new to slurm. Kindly guide me in this matter.____ > >