What does running this on the compute node show? (looks at journal log for past 12 hours) journalctl -S -12h -o verbose | grep slurm
You may want to increase your debug verbosity to debug5 https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking down this issue. For reference, see https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug You should also address this error to fix logging: [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied by making a directory /var/log/slurm and making the slurm user the owner on both the controller and compute node. Then update your slurm.conf file like this: # LOGGING SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log and then running 'scontrol reconfigure' Kind Regards, Glen ========================================== Glen MacLachlan, PhD *Lead High Performance Computing Engineer * Research Technology Services The George Washington University 44983 Knoll Square Enterprise Hall, 328L Ashburn, VA 20147 ========================================== On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <layto...@gmail.com> wrote: > Good afternoon, > > I have a very simple two node cluster using Warewulf 4.3. I was following > some instructions on how to install the OpenHPC Slurm binaries (server and > client). I booted the compute node and the Slurm Server says it's in an > unknown state. This hasn't happened to me before but I would like to debug > the problem. > > I checked the services on the S:urm server (head node) > > $ systemctl status munge > ● munge.service - MUNGE authentication service > Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor > preset: disabled) > Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s ago > Docs: man:munged(8) > Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) > Main PID: 1182 (munged) > Tasks: 4 (limit: 48440) > Memory: 1.2M > CGroup: /system.slice/munge.service > └─1182 /usr/sbin/munged > > Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE > authentication service... > Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE > authentication service. > > $ systemctl status slurmctld > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; > vendor preset: disabled) > Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s ago > Main PID: 1518 (slurmctld) > Tasks: 10 > Memory: 23.0M > CGroup: /system.slice/slurmctld.service > ├─1518 /usr/sbin/slurmctld -D -s > └─1555 slurmctld: slurmscriptd > > Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm controller > daemon. > Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No > parameter for mcs plugin, de> > Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: > MCSParameters = (null). on> > Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: > SchedulerParameters=default_que> > > > > I then booted the compute node and checked the services there: > > systemctl status munge > ● munge.service - MUNGE authentication service > Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor > preset: disabled) > Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s ago > Docs: man:munged(8) > Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) > Main PID: 804 (munged) > Tasks: 4 (limit: 26213) > Memory: 940.0K > CGroup: /system.slice/munge.service > └─804 /usr/sbin/munged > > Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service... > Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service. > > systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor > preset: disabled) > Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC; > 2min 40s ago > Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS > (code=exited, status=1/FAILURE) > Main PID: 897 (code=exited, status=1/FAILURE) > > Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon. > Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, > code=exited, status=1/FAIL> > Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result > 'exit-code'. > > # systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor > preset: disabled) > Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago > Main PID: 996 (slurmd) > Tasks: 2 > Memory: 1012.0K > CGroup: /system.slice/slurmd.service > ├─996 /usr/sbin/slurmd -D -s --conf-server localhost > └─997 /usr/sbin/slurmd -D -s --conf-server localhost > > Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. > > > > > On the SLurm server I checked the queue and "sinfo -a" and found the > following: > > $ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > $ sinfo -a > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > normal* up 1-00:00:00 1 unk* n0001 > > > After a few moments (less than a minute - maybe 20-30 seconds, slurmd on > the compute node fails. WHen I checked the service I saw this: > > $ systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor > preset: disabled) > Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC; > 10min ago > Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS > (code=exited, status=1/FAILURE) > Main PID: 996 (code=exited, status=1/FAILURE) > > Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. > Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, > code=exited, status=1/FAIL> > Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result > 'exit-code'. > > > Below are the logs for the slurm server for today (I rebooted the compute > twice) > > [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied > [2022-12-08T13:12:17.343] error: Configured MailProg is invalid > [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster > cluster > [2022-12-08T13:12:17.371] No memory enforcing mechanism configured. > [2022-12-08T13:12:17.374] Recovered state of 1 nodes > [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0 > [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0 > [2022-12-08T13:12:17.374] Recovered information about 2 jobs > [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: > select/cons_tres: preparing for 1 partitions > [2022-12-08T13:12:17.375] Recovered state of 0 reservations > [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified > [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: > select/cons_tres: reconfigure > [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: > select/cons_tres: preparing for 1 partitions > [2022-12-08T13:12:17.376] Running as primary controller > [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set > [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set. > [2022-12-08T13:13:17.471] > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 > [2022-12-08T13:17:17.940] error: Nodes n0001 not responding > [2022-12-08T13:22:17.533] error: Nodes n0001 not responding > [2022-12-08T13:27:17.048] error: Nodes n0001 not responding > > There are no logs on the compute node. > > Any suggestions where to start looking? I think I'm seeing the trees and > not the forest :) > > Thanks! > > Jeff > > P.S Here's some relevant features from the server slurm.conf > > > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ClusterName=cluster > SlurmctldHost=localhost > #SlurmctldHost= > ... > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ClusterName=cluster > SlurmctldHost=localhost > #SlurmctldHost= > > > > > Here's some relevant parts of slurm.conf on the client node: > > > > > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ClusterName=cluster > SlurmctldHost=localhost > #SlurmctldHost= > ... > # slurm.conf file generated by configurator.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ClusterName=cluster > SlurmctldHost=localhost > #SlurmctldHost= > > > > > > > >