Good afternoon, I have a very simple two node cluster using Warewulf 4.3. I was following some instructions on how to install the OpenHPC Slurm binaries (server and client). I booted the compute node and the Slurm Server says it's in an unknown state. This hasn't happened to me before but I would like to debug the problem.
I checked the services on the S:urm server (head node) $ systemctl status munge ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s ago Docs: man:munged(8) Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) Main PID: 1182 (munged) Tasks: 4 (limit: 48440) Memory: 1.2M CGroup: /system.slice/munge.service └─1182 /usr/sbin/munged Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE authentication service... Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE authentication service. $ systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s ago Main PID: 1518 (slurmctld) Tasks: 10 Memory: 23.0M CGroup: /system.slice/slurmctld.service ├─1518 /usr/sbin/slurmctld -D -s └─1555 slurmctld: slurmscriptd Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm controller daemon. Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No parameter for mcs plugin, de> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: MCSParameters = (null). on> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: SchedulerParameters=default_que> I then booted the compute node and checked the services there: systemctl status munge ● munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s ago Docs: man:munged(8) Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) Main PID: 804 (munged) Tasks: 4 (limit: 26213) Memory: 940.0K CGroup: /system.slice/munge.service └─804 /usr/sbin/munged Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service... Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service. systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC; 2min 40s ago Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 897 (code=exited, status=1/FAILURE) Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon. Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'. # systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago Main PID: 996 (slurmd) Tasks: 2 Memory: 1012.0K CGroup: /system.slice/slurmd.service ├─996 /usr/sbin/slurmd -D -s --conf-server localhost └─997 /usr/sbin/slurmd -D -s --conf-server localhost Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. On the SLurm server I checked the queue and "sinfo -a" and found the following: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ sinfo -a PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 1 unk* n0001 After a few moments (less than a minute - maybe 20-30 seconds, slurmd on the compute node fails. WHen I checked the service I saw this: $ systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC; 10min ago Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 996 (code=exited, status=1/FAILURE) Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'. Below are the logs for the slurm server for today (I rebooted the compute twice) [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied [2022-12-08T13:12:17.343] error: Configured MailProg is invalid [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster cluster [2022-12-08T13:12:17.371] No memory enforcing mechanism configured. [2022-12-08T13:12:17.374] Recovered state of 1 nodes [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0 [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0 [2022-12-08T13:12:17.374] Recovered information about 2 jobs [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions [2022-12-08T13:12:17.375] Recovered state of 0 reservations [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions [2022-12-08T13:12:17.376] Running as primary controller [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set. [2022-12-08T13:13:17.471] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2022-12-08T13:17:17.940] error: Nodes n0001 not responding [2022-12-08T13:22:17.533] error: Nodes n0001 not responding [2022-12-08T13:27:17.048] error: Nodes n0001 not responding There are no logs on the compute node. Any suggestions where to start looking? I think I'm seeing the trees and not the forest :) Thanks! Jeff P.S Here's some relevant features from the server slurm.conf # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=cluster SlurmctldHost=localhost #SlurmctldHost= ... # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=cluster SlurmctldHost=localhost #SlurmctldHost= Here's some relevant parts of slurm.conf on the client node: # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=cluster SlurmctldHost=localhost #SlurmctldHost= ... # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=cluster SlurmctldHost=localhost #SlurmctldHost=