Thanks Glenn! I change the slurm.conf logging to "debug5" on both the server and the client.
I also created /var/log/slurm on both the client and server and chown-ed to slurm:slurm. On the server I did "scontrol reconfigure". Then I rebooted the compute node. When I logged in, slurm was not up. I ran systemctl start slurmd. It stayed for about 5 seconds then stoppped. # systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2022-12-08 19:51:58 UTC; 2min 33s ago Process: 1299 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 1299 (code=exited, status=1/FAILURE) Dec 08 19:51:49 n0001 systemd[1]: Started Slurm node daemon. Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL> Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'. Here is the output from grepping through journalctl: UNIT=slurmd.service SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd MESSAGE=error: slurmd initialization failed UNIT=slurmd.service MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE UNIT=slurmd.service MESSAGE=slurmd.service: Failed with result 'exit-code'. MESSAGE=Operator of unix-process:911:7771 successfully authenticated as unix-user:root to gain ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1.24 [systemctl start slurmd] (owned by unix-user:laytonjb) UNIT=slurmd.service SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd MESSAGE=error: slurmd initialization failed UNIT=slurmd.service MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE UNIT=slurmd.service MESSAGE=slurmd.service: Failed with result 'exit-code'. UNIT=slurmd.service SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd MESSAGE=error: slurmd initialization failed UNIT=slurmd.service MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE UNIT=slurmd.service MESSAGE=slurmd.service: Failed with result 'exit-code'. MESSAGE=Operator of unix-process:1254:240421 successfully authenticated as unix-user:root to g ain ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1 .47 [systemctl start slurmd] (owned by unix-user:laytonjb) UNIT=slurmd.service SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd MESSAGE=error: slurmd initialization failed UNIT=slurmd.service MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE UNIT=slurmd.service MESSAGE=slurmd.service: Failed with result 'exit-code'. UNIT=slurmd.service SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd SYSLOG_IDENTIFIER=slurmd _COMM=slurmd MESSAGE=error: slurmd initialization failed UNIT=slurmd.service MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE UNIT=slurmd.service MESSAGE=slurmd.service: Failed with result 'exit-code'. These don't look too useful even with debug5 on. Any thoughts? Thanks! Jeff On Thu, Dec 8, 2022 at 2:01 PM Glen MacLachlan <macl...@gwu.edu> wrote: > > What does running this on the compute node show? (looks at journal log for > past 12 hours) > journalctl -S -12h -o verbose | grep slurm > > > You may want to increase your debug verbosity to debug5 > https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking > down this issue. > For reference, see > https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug > > You should also address this error to fix logging: > [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied > > by making a directory /var/log/slurm and making the slurm user the owner > on both the controller and compute node. Then update your slurm.conf file > like this: > # LOGGING > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmdLogFile=/var/log/slurm/slurmd.log > > and then running 'scontrol reconfigure' > > Kind Regards, > Glen > > ========================================== > Glen MacLachlan, PhD > *Lead High Performance Computing Engineer * > > Research Technology Services > The George Washington University > 44983 Knoll Square > Enterprise Hall, 328L > Ashburn, VA 20147 > > ========================================== > > > > > > > > On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <layto...@gmail.com> wrote: > >> Good afternoon, >> >> I have a very simple two node cluster using Warewulf 4.3. I was following >> some instructions on how to install the OpenHPC Slurm binaries (server and >> client). I booted the compute node and the Slurm Server says it's in an >> unknown state. This hasn't happened to me before but I would like to debug >> the problem. >> >> I checked the services on the S:urm server (head node) >> >> $ systemctl status munge >> ● munge.service - MUNGE authentication service >> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor >> preset: disabled) >> Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s >> ago >> Docs: man:munged(8) >> Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) >> Main PID: 1182 (munged) >> Tasks: 4 (limit: 48440) >> Memory: 1.2M >> CGroup: /system.slice/munge.service >> └─1182 /usr/sbin/munged >> >> Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE >> authentication service... >> Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE >> authentication service. >> >> $ systemctl status slurmctld >> ● slurmctld.service - Slurm controller daemon >> Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; >> vendor preset: disabled) >> Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s >> ago >> Main PID: 1518 (slurmctld) >> Tasks: 10 >> Memory: 23.0M >> CGroup: /system.slice/slurmctld.service >> ├─1518 /usr/sbin/slurmctld -D -s >> └─1555 slurmctld: slurmscriptd >> >> Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm >> controller daemon. >> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No >> parameter for mcs plugin, de> >> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: >> MCSParameters = (null). on> >> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: >> SchedulerParameters=default_que> >> >> >> >> I then booted the compute node and checked the services there: >> >> systemctl status munge >> ● munge.service - MUNGE authentication service >> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor >> preset: disabled) >> Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s >> ago >> Docs: man:munged(8) >> Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) >> Main PID: 804 (munged) >> Tasks: 4 (limit: 26213) >> Memory: 940.0K >> CGroup: /system.slice/munge.service >> └─804 /usr/sbin/munged >> >> Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service... >> Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service. >> >> systemctl status slurmd >> ● slurmd.service - Slurm node daemon >> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >> vendor preset: disabled) >> Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC; >> 2min 40s ago >> Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS >> (code=exited, status=1/FAILURE) >> Main PID: 897 (code=exited, status=1/FAILURE) >> >> Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon. >> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, >> code=exited, status=1/FAIL> >> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result >> 'exit-code'. >> >> # systemctl status slurmd >> ● slurmd.service - Slurm node daemon >> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >> vendor preset: disabled) >> Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago >> Main PID: 996 (slurmd) >> Tasks: 2 >> Memory: 1012.0K >> CGroup: /system.slice/slurmd.service >> ├─996 /usr/sbin/slurmd -D -s --conf-server localhost >> └─997 /usr/sbin/slurmd -D -s --conf-server localhost >> >> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. >> >> >> >> >> On the SLurm server I checked the queue and "sinfo -a" and found the >> following: >> >> $ squeue >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> $ sinfo -a >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> normal* up 1-00:00:00 1 unk* n0001 >> >> >> After a few moments (less than a minute - maybe 20-30 seconds, slurmd on >> the compute node fails. WHen I checked the service I saw this: >> >> $ systemctl status slurmd >> ● slurmd.service - Slurm node daemon >> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >> vendor preset: disabled) >> Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC; >> 10min ago >> Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS >> (code=exited, status=1/FAILURE) >> Main PID: 996 (code=exited, status=1/FAILURE) >> >> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. >> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, >> code=exited, status=1/FAIL> >> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result >> 'exit-code'. >> >> >> Below are the logs for the slurm server for today (I rebooted the compute >> twice) >> >> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied >> [2022-12-08T13:12:17.343] error: Configured MailProg is invalid >> [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster >> cluster >> [2022-12-08T13:12:17.371] No memory enforcing mechanism configured. >> [2022-12-08T13:12:17.374] Recovered state of 1 nodes >> [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0 >> [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0 >> [2022-12-08T13:12:17.374] Recovered information about 2 jobs >> [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: >> select/cons_tres: preparing for 1 partitions >> [2022-12-08T13:12:17.375] Recovered state of 0 reservations >> [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified >> [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: >> select/cons_tres: reconfigure >> [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: >> select/cons_tres: preparing for 1 partitions >> [2022-12-08T13:12:17.376] Running as primary controller >> [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set >> [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set. >> [2022-12-08T13:13:17.471] >> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 >> [2022-12-08T13:17:17.940] error: Nodes n0001 not responding >> [2022-12-08T13:22:17.533] error: Nodes n0001 not responding >> [2022-12-08T13:27:17.048] error: Nodes n0001 not responding >> >> There are no logs on the compute node. >> >> Any suggestions where to start looking? I think I'm seeing the trees and >> not the forest :) >> >> Thanks! >> >> Jeff >> >> P.S Here's some relevant features from the server slurm.conf >> >> >> # slurm.conf file generated by configurator.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> ClusterName=cluster >> SlurmctldHost=localhost >> #SlurmctldHost= >> ... >> # slurm.conf file generated by configurator.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> ClusterName=cluster >> SlurmctldHost=localhost >> #SlurmctldHost= >> >> >> >> >> Here's some relevant parts of slurm.conf on the client node: >> >> >> >> >> # slurm.conf file generated by configurator.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> ClusterName=cluster >> SlurmctldHost=localhost >> #SlurmctldHost= >> ... >> # slurm.conf file generated by configurator.html. >> # Put this file on all nodes of your cluster. >> # See the slurm.conf man page for more information. >> # >> ClusterName=cluster >> SlurmctldHost=localhost >> #SlurmctldHost= >> >> >> >> >> >> >> >>