Hi Nousheen,

It seems that you have configured incorrectly the nodes in slurm.conf. I notice this:

  RealMemory=1

This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in the 1980ies :-)

See how to configure nodes in https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration

You must run "slurmd -C" on each node to determine its actual hardware.

I hope this helps.

/Ole

On 12/1/22 21:08, Nousheen wrote:
Dear Robbert,

Thankyou so much for your response. I was so focused on sync of time that I missed the date on one of the nodes which was 1 day behind as you said. I have corrected it and now i get the following output in status.

*(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l*
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
    Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
  Main PID: 19475 (slurmctld)
     Tasks: 10
    Memory: 4.5M
    CGroup: /system.slice/slurmctld.service
            ├─19475 /usr/sbin/slurmctld -D -s
            └─19538 slurmctld: slurmscriptd

Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: _start_job: Started JobId=106 in debug on 101 Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=106 WEXITSTATUS 1 Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=106 done Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=107 NodeList=101 #CPUs=8 Partition=debug Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=108 NodeList=105 #CPUs=8 Partition=debug Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate JobId=109 NodeList=nousheen #CPUs=8 Partition=debug Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=107 WEXITSTATUS 1 Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=107 done Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=108 WEXITSTATUS 1 Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: JobId=108 done

I have total four nodes one of which is the server node. After submitting a job, the job only runs at my server compute node while all the other nodes are IDLE, DOWN or nonresponding. The details are given below:

*(base) [nousheen@nousheen slurm]$ scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=0.01
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
    OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
    RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=debug
    BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
    LastBusyTime=2022-12-02T00:58:31
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=104 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=N/A
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.114 NodeHostName=104
    RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=debug
    BootTime=None SlurmdStartTime=None
    LastBusyTime=2022-12-01T21:37:35
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=Not responding [slurm@2022-12-01T16:22:28]

NodeName=105 Arch=x86_64 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=1.08
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
    OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
    RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=debug
    BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
    LastBusyTime=2022-12-01T21:47:11
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=nousheen Arch=x86_64 CoresPerSocket=6
    CPUAlloc=8 CPUTot=12 CPULoad=6.73
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
    OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
    RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=debug
    BootTime=2022-12-01T12:00:08 SlurmdStartTime=2022-12-01T12:00:42
    LastBusyTime=2022-12-01T21:37:39
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=cpu=8
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Where as this command shows only one node on which job is running:

*(base) [nousheen@nousheen slurm]$ squeue -j*
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                109     debug   SRBD-4 nousheen  R    3:17:48      1 nousheen

Can you please guide me as to why my compute nodes are down and not working?

Thank you for your time.


Best Regards,
Nousheen Parvaiz


ᐧ

On Thu, Dec 1, 2022 at 8:55 PM Michael Robbert <mrobb...@mines.edu <mailto:mrobb...@mines.edu>> wrote:

    I believe that the error you need to pay attention to for this issue
    is this line:____

    __ __

    Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
    out of sync clocks____

    __ __

    __ __

    It looks like your compute nodes clock is a full day ahead of your
    controller node. Dec. 2 instead of Dec. 1. The clocks need to be in
    sync for munge to work.____

    __ __

    *Mike Robbert*____

    *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
    Research Computing*____

    Information and Technology Solutions (ITS)____

    303-273-3786 | mrobb...@mines.edu <mailto:mrobb...@mines.edu>____

    A close up of a sign Description automatically generated____

    *Our values:*Trust | Integrity | Respect | Responsibility____

    __ __

    __ __

    *From: *slurm-users <slurm-users-boun...@lists.schedmd.com
    <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Nousheen
    <nousheenparv...@gmail.com <mailto:nousheenparv...@gmail.com>>
    *Date: *Thursday, December 1, 2022 at 06:19
    *To: *Slurm User Community List <slurm-users@lists.schedmd.com
    <mailto:slurm-users@lists.schedmd.com>>
    *Subject: *[External] [slurm-users] ERROR: slurmctld: auth/munge:
    _print_cred: DECODED____

    *CAUTION:*This email originated from outside of the Colorado School of
    Mines organization. Do not click on links or open attachments unless
    you recognize the sender and know the content is safe.____

    __ __

    __ __

    __ __

    Hello Everyone,____

    __ __

    I am using slurm version 21.08.5 and Centos 7.____

    __ __

      I successfully start slurmd on all compute nodes but when I start
    slurmctld on server node it gives the following error:____

    __ __

    *(base) [nousheen@nousheen ~]$ systemctl status slurmctld.service -l*
    ● slurmctld.service - Slurm controller daemon
        Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
    vendor preset: disabled)
        Active: active (running) since Thu 2022-12-01 12:00:42 PKT; 4h
    16min ago
      Main PID: 1631 (slurmctld)
         Tasks: 10
        Memory: 4.0M
        CGroup: /system.slice/slurmctld.service
    ├─1631 /usr/sbin/slurmctld -D -s
                └─1818 slurmctld: slurmscriptd

    Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: auth/munge:
    _print_cred: DECODED: Thu Dec 01 16:17:19 2022
    Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for
    out of sync clocks
    Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Munge
    decode failed: Rewound credential
    Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
    _print_cred: ENCODED: Fri Dec 02 16:16:55 2022
    Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: auth/munge:
    _print_cred: DECODED: Thu Dec 01 16:17:20 2022
    Dec 01 16:17:20 nousheen slurmctld[1631]: slurmctld: error: Check for
    out of sync clocks
    Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Munge
    decode failed: Rewound credential
    Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
    _print_cred: ENCODED: Fri Dec 02 16:16:56 2022
    Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: auth/munge:
    _print_cred: DECODED: Thu Dec 01 16:17:21 2022
    Dec 01 16:17:21 nousheen slurmctld[1631]: slurmctld: error: Check for
    out of sync clocks____

    __ __

    When I run the following command on compute nodes I get the following
    output:____

    __ __

      [gpu101@101 ~]$*munge -n | unmunge*____

    STATUS:           Success (0)
    ENCODE_HOST:      ??? (0.0.0.101)
    ENCODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
    DECODE_TIME:      2022-12-02 16:33:38 +0500 (1669980818)
    TTL:              300
    CIPHER:           aes128 (4)
    MAC:              sha1 (3)
    ZIP:              none (0)
    UID:              gpu101 (1000)
    GID:              gpu101 (1000)
    LENGTH:           0____

    __ __

    Is this error because the encode_host name has question marks and the
    IP is also not picked correctly by munge. How can I correct this? All
    the nodes keep non-responding when I run a job. However, I have all
    the clocks synced across the cluster. ____

    __ __

    I am new to slurm. Kindly guide me in this matter.____

Reply via email to