I am doing a new install of slurm 24.05.3 I have all the packages built and 
installed on headnode and compute node with the same munge.key, slurm.conf, and 
gres.conf file. I was able to run munge and unmunge commands to test munge 
successfully. Time is synced with chronyd. I can't seem to find any useful 
errors in the logs. For some reason when I run sinfo no nodes are listed. I 
just see the headers for each column. Has anyone seen this or know what a next 
step of troubleshooting would be? I'm new to this and not sure where to go from 
here. Thanks for any and all help!



The odd output I am seeing

[username@headnode ~] sinfo

PARTITION AVAIL    TIMELIMIT NODES   STATE   NODELIST



(Nothing is output showing status of partition or nodes)





Slurm.conf



ClusterName=slurmkvasir

SlurmctldHost=kadmin2

MpiDefault=none

ProctrackType=proctrack/cgroup

PrologFlags=contain

ReturnToService=2

SlurmctldPidFile=/var/run/slurm/slurmctld.pid

SlurmctldPort=6817

SlurmPidFile=/var/run/slurm/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

StateSaveLocation=/var/spool/slurmctld

TaskPlugin=task/cgroup

MinJobAge=600

SchedulerType=sched/backfill

SelectType=select/cons_tres

PriorityType=priority/multifactor

AccountingStorageHost=localhost

AccountingStoragePass=/var/run/munge/munge.socket.2

AccountingStorageType=accounting_storage/slurmdbd

AccountingStorageTRES=gres/gpu,cpu,node

JobCompType=jobcomp/none

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/cgroup

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmdDebug=info

SlurmLogFile=/var/log/slurm/slurmd.log

nodeName=k[001-448]

PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up



Slurmctld.log



Error: Configured MailProg is invalid

Slurmctld version 24.05.3 started on cluster slurmkvasir

Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering 
slurmctld at port 8617

Error: read_slurm_conf: default partition not set.

Revovered state of 448 nodes

Down nodes: k[002-448]

Recovered information about 0 jobs

Revovered state of 0 reservations

Read_slurm_conf: backup_controller not specified

Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure

Running as primary controller



Slurmd.log



Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) 
SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw)

CPU frequency setting not configured for this node

Slurmd version 24.05.3started

Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700

CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 
CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Error: _forward_thread: failed to k019 (10.142.0.119:6818): Connection timed out

(Above line repeated 20 or so times for different nodes.)



Thanks,

Kent Hanson

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to