Hi Nousheen,
It seems that you have configured incorrectly the nodes in slurm.conf. I
notice this:
RealMemory=1
This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in
the 1980ies :-)
See how to configure nodes in
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
You must run "slurmd -C" on each node to determine its actual hardware.
I hope this helps.
/Ole
On 12/1/22 21:08, Nousheen wrote:
Dear Robbert,
Thankyou so much for your response. I was so focused on sync of time that
I missed the date on one of the nodes which was 1 day behind as you said.
I have corrected it and now i get the following output in status.
*(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l*
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: disabled)
Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
Main PID: 19475 (slurmctld)
Tasks: 10
Memory: 4.5M
CGroup: /system.slice/slurmctld.service
├─19475 /usr/sbin/slurmctld -D -s
└─19538 slurmctld: slurmscriptd
Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill:
_start_job: Started JobId=106 in debug on 101
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=106 WEXITSTATUS 1
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=106 done
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=107 NodeList=101 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=108 NodeList=105 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate
JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=107 WEXITSTATUS 1
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=107 done
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=108 WEXITSTATUS 1
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete:
JobId=108 done
I have total four nodes one of which is the server node. After submitting
a job, the job only runs at my server compute node while all the other
nodes are IDLE, DOWN or nonresponding. The details are given below:
*(base) [nousheen@nousheen slurm]$ scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
LastBusyTime=2022-12-02T00:58:31
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=104 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.114 NodeHostName=104
RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1
Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=None SlurmdStartTime=None
LastBusyTime=2022-12-01T21:37:35
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2022-12-01T16:22:28]
NodeName=105 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=12 CPULoad=1.08
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
LastBusyTime=2022-12-01T21:47:11
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=nousheen Arch=x86_64 CoresPerSocket=6
CPUAlloc=8 CPUTot=12 CPULoad=6.73
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
LastBusyTime=2022-12-01T21:37:39
CfgTRES=cpu=12,mem=1M,billing=12
AllocTRES=cpu=8
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Where as this command shows only one node on which job is running:
*(base) [nousheen@nousheen slurm]$ squeue -j*
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
109 debug SRBD-4 nousheen R 3:17:48 1 nousheen
Can you please guide me as to why my compute nodes are down and not working?
Thank you for your time.
Best Regards,
Nousheen Parvaiz
ᐧ
On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobb...@mines.edu
<mailto:mrobb...@mines.edu>> wrote:
I believe that the error you need to pay attention to for this issue
is this line:____
__ __
Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
out of sync clocks____
__ __
__ __
It looks like your compute nodes clock is a full day ahead of your
controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
sync for munge to work.____
__ __
*Mike Robbert*____
*Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
Research Computing*____
Information and Technology Solutions (ITS)____
303-273-3786 | mrobb...@mines.edu <mailto:mrobb...@mines.edu>____
A close up of a sign Description automatically generated____
*Our values:*Trust | Integrity | Respect | Responsibility____
__ __
__ __
*From: *slurm-users <slurm-users-boun...@lists.schedmd.com
<mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Nousheen
<nousheenparv...@gmail.com <mailto:nousheenparv...@gmail.com>>
*Date: *Thursday, December 1, 2022 at 06:19
*To: *Slurm User Community List <slurm-users@lists.schedmd.com
<mailto:slurm-users@lists.schedmd.com>>
*Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
_print_cred: DECODED____
*CAUTION:*This email originated from outside of the Colorado School of
Mines organization. Do not click on links or open attachments unless
you recognize the sender and know the content is safe.____
__ __
__ __
__ __
Hello Everyone,____
__ __
I am using slurm version 21.08.5 and Centos 7.____
__ __
I successfully start slurmd on all compute nodes but when I start
slurmctld on server node it gives the following error:____
__ __
*(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l*
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
16min ago
Main PID: 1631 (slurmctld)
Tasks: 10
Memory: 4.0M
CGroup: /system.slice/slurmctld.service
├─1631 /usr/sbin/slurmctld -D -s
└─1818 slurmctld: slurmscriptd
Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
_print_cred: DECODED: Thu Dec 01 16:17:19 2022
Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
out of sync clocks
Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
decode failed: Rewound credential
Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
_print_cred: ENCODED: Fri Dec 02 16:16:55 2022
Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
_print_cred: DECODED: Thu Dec 01 16:17:20 2022
Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
out of sync clocks
Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
decode failed: Rewound credential
Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
_print_cred: ENCODED: Fri Dec 02 16:16:56 2022
Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
_print_cred: DECODED: Thu Dec 01 16:17:21 2022
Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
out of sync clocks____
__ __
When I run the following command on compute nodes I get the following
output:____
__ __
[gpu101@101 ~]$*munge -n | unmunge*____
STATUS: Success (0)
ENCODE_HOST: ??? (0.0.0.101)
ENCODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
DECODE_TIME: 2022-12-02 16:33:38 +0500 (1669980818)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: gpu101 (1000)
GID: gpu101 (1000)
LENGTH: 0____
__ __
Is this error because the encode_host name has question marks and the
IP is also not picked correctly by munge. How can I correct this? All
the nodes keep non-responding when I run a job. However, I have all
the clocks synced across the cluster. ____
__ __
I am new to slurm. Kindly guide me in this matter.____