Then try using the IP of the controller node as explained here https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldAddr or here https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost.
Also, if you look at the first few lines of /etc/hosts (just above the line that reads ### ALL ENTRIES BELOW THIS LINE WILL BE OVERWRITTEN BY WAREWULF ###) you should see a hostname for the head node and it's IP address. If you don't set this correctly then the slurmd daemon won't know how to reach the slurmctld daemon. Kind Regards, Glen ========================================== Glen MacLachlan, PhD *Lead High Performance Computing Engineer * Research Technology Services The George Washington University 44983 Knoll Square Enterprise Hall, 328L Ashburn, VA 20147 ========================================== On Thu, Dec 8, 2022 at 3:06 PM Jeffrey Layton <layto...@gmail.com> wrote: > localhost is the ctrl name :) > > I can change it though if needed (I was lazy when I did the initial > installation). > > Thanks! > > Jeff > > > On Thu, Dec 8, 2022 at 2:30 PM Glen MacLachlan <macl...@gwu.edu> wrote: > >> One other thing to address is that SlurmctldHost should point to the >> controller node where slurmctld is running, the name of which I would >> expect Warewulf would put into /etc/hosts. >> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost >> >> Kind Regards, >> Glen >> >> ========================================== >> Glen MacLachlan, PhD >> *Lead High Performance Computing Engineer * >> >> Research Technology Services >> The George Washington University >> 44983 Knoll Square >> Enterprise Hall, 328L >> Ashburn, VA 20147 >> >> ========================================== >> >> >> >> >> >> >> >> On Thu, Dec 8, 2022 at 1:58 PM Glen MacLachlan <macl...@gwu.edu> wrote: >> >>> >>> What does running this on the compute node show? (looks at journal log >>> for past 12 hours) >>> journalctl -S -12h -o verbose | grep slurm >>> >>> >>> You may want to increase your debug verbosity to debug5 >>> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while >>> tracking down this issue. >>> For reference, see >>> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug >>> >>> You should also address this error to fix logging: >>> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied >>> >>> by making a directory /var/log/slurm and making the slurm user the owner >>> on both the controller and compute node. Then update your slurm.conf file >>> like this: >>> # LOGGING >>> SlurmctldLogFile=/var/log/slurm/slurmctld.log >>> SlurmdLogFile=/var/log/slurm/slurmd.log >>> >>> and then running 'scontrol reconfigure' >>> >>> Kind Regards, >>> Glen >>> >>> ========================================== >>> Glen MacLachlan, PhD >>> *Lead High Performance Computing Engineer * >>> >>> Research Technology Services >>> The George Washington University >>> 44983 Knoll Square >>> Enterprise Hall, 328L >>> Ashburn, VA 20147 >>> >>> ========================================== >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <layto...@gmail.com> >>> wrote: >>> >>>> Good afternoon, >>>> >>>> I have a very simple two node cluster using Warewulf 4.3. I was >>>> following some instructions on how to install the OpenHPC Slurm binaries >>>> (server and client). I booted the compute node and the Slurm Server says >>>> it's in an unknown state. This hasn't happened to me before but I would >>>> like to debug the problem. >>>> >>>> I checked the services on the S:urm server (head node) >>>> >>>> $ systemctl status munge >>>> ● munge.service - MUNGE authentication service >>>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; >>>> vendor preset: disabled) >>>> Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s >>>> ago >>>> Docs: man:munged(8) >>>> Process: 1140 ExecStart=/usr/sbin/munged (code=exited, >>>> status=0/SUCCESS) >>>> Main PID: 1182 (munged) >>>> Tasks: 4 (limit: 48440) >>>> Memory: 1.2M >>>> CGroup: /system.slice/munge.service >>>> └─1182 /usr/sbin/munged >>>> >>>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE >>>> authentication service... >>>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE >>>> authentication service. >>>> >>>> $ systemctl status slurmctld >>>> ● slurmctld.service - Slurm controller daemon >>>> Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; >>>> vendor preset: disabled) >>>> Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s >>>> ago >>>> Main PID: 1518 (slurmctld) >>>> Tasks: 10 >>>> Memory: 23.0M >>>> CGroup: /system.slice/slurmctld.service >>>> ├─1518 /usr/sbin/slurmctld -D -s >>>> └─1555 slurmctld: slurmscriptd >>>> >>>> Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm >>>> controller daemon. >>>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No >>>> parameter for mcs plugin, de> >>>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: >>>> MCSParameters = (null). on> >>>> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: >>>> SchedulerParameters=default_que> >>>> >>>> >>>> >>>> I then booted the compute node and checked the services there: >>>> >>>> systemctl status munge >>>> ● munge.service - MUNGE authentication service >>>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; >>>> vendor preset: disabled) >>>> Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s >>>> ago >>>> Docs: man:munged(8) >>>> Process: 786 ExecStart=/usr/sbin/munged (code=exited, >>>> status=0/SUCCESS) >>>> Main PID: 804 (munged) >>>> Tasks: 4 (limit: 26213) >>>> Memory: 940.0K >>>> CGroup: /system.slice/munge.service >>>> └─804 /usr/sbin/munged >>>> >>>> Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication >>>> service... >>>> Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service. >>>> >>>> systemctl status slurmd >>>> ● slurmd.service - Slurm node daemon >>>> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >>>> vendor preset: disabled) >>>> Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 >>>> UTC; 2min 40s ago >>>> Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS >>>> (code=exited, status=1/FAILURE) >>>> Main PID: 897 (code=exited, status=1/FAILURE) >>>> >>>> Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon. >>>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, >>>> code=exited, status=1/FAIL> >>>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result >>>> 'exit-code'. >>>> >>>> # systemctl status slurmd >>>> ● slurmd.service - Slurm node daemon >>>> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >>>> vendor preset: disabled) >>>> Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago >>>> Main PID: 996 (slurmd) >>>> Tasks: 2 >>>> Memory: 1012.0K >>>> CGroup: /system.slice/slurmd.service >>>> ├─996 /usr/sbin/slurmd -D -s --conf-server localhost >>>> └─997 /usr/sbin/slurmd -D -s --conf-server localhost >>>> >>>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. >>>> >>>> >>>> >>>> >>>> On the SLurm server I checked the queue and "sinfo -a" and found the >>>> following: >>>> >>>> $ squeue >>>> JOBID PARTITION NAME USER ST TIME NODES >>>> NODELIST(REASON) >>>> $ sinfo -a >>>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>>> normal* up 1-00:00:00 1 unk* n0001 >>>> >>>> >>>> After a few moments (less than a minute - maybe 20-30 seconds, slurmd >>>> on the compute node fails. WHen I checked the service I saw this: >>>> >>>> $ systemctl status slurmd >>>> ● slurmd.service - Slurm node daemon >>>> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >>>> vendor preset: disabled) >>>> Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 >>>> UTC; 10min ago >>>> Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS >>>> (code=exited, status=1/FAILURE) >>>> Main PID: 996 (code=exited, status=1/FAILURE) >>>> >>>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. >>>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, >>>> code=exited, status=1/FAIL> >>>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result >>>> 'exit-code'. >>>> >>>> >>>> Below are the logs for the slurm server for today (I rebooted the >>>> compute twice) >>>> >>>> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied >>>> [2022-12-08T13:12:17.343] error: Configured MailProg is invalid >>>> [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster >>>> cluster >>>> [2022-12-08T13:12:17.371] No memory enforcing mechanism configured. >>>> [2022-12-08T13:12:17.374] Recovered state of 1 nodes >>>> [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0 >>>> [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0 >>>> [2022-12-08T13:12:17.374] Recovered information about 2 jobs >>>> [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: >>>> select/cons_tres: preparing for 1 partitions >>>> [2022-12-08T13:12:17.375] Recovered state of 0 reservations >>>> [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not >>>> specified >>>> [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: >>>> select/cons_tres: reconfigure >>>> [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: >>>> select/cons_tres: preparing for 1 partitions >>>> [2022-12-08T13:12:17.376] Running as primary controller >>>> [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values >>>> set >>>> [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set. >>>> [2022-12-08T13:13:17.471] >>>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 >>>> [2022-12-08T13:17:17.940] error: Nodes n0001 not responding >>>> [2022-12-08T13:22:17.533] error: Nodes n0001 not responding >>>> [2022-12-08T13:27:17.048] error: Nodes n0001 not responding >>>> >>>> There are no logs on the compute node. >>>> >>>> Any suggestions where to start looking? I think I'm seeing the trees >>>> and not the forest :) >>>> >>>> Thanks! >>>> >>>> Jeff >>>> >>>> P.S Here's some relevant features from the server slurm.conf >>>> >>>> >>>> # slurm.conf file generated by configurator.html. >>>> # Put this file on all nodes of your cluster. >>>> # See the slurm.conf man page for more information. >>>> # >>>> ClusterName=cluster >>>> SlurmctldHost=localhost >>>> #SlurmctldHost= >>>> ... >>>> # slurm.conf file generated by configurator.html. >>>> # Put this file on all nodes of your cluster. >>>> # See the slurm.conf man page for more information. >>>> # >>>> ClusterName=cluster >>>> SlurmctldHost=localhost >>>> #SlurmctldHost= >>>> >>>> >>>> >>>> >>>> Here's some relevant parts of slurm.conf on the client node: >>>> >>>> >>>> >>>> >>>> # slurm.conf file generated by configurator.html. >>>> # Put this file on all nodes of your cluster. >>>> # See the slurm.conf man page for more information. >>>> # >>>> ClusterName=cluster >>>> SlurmctldHost=localhost >>>> #SlurmctldHost= >>>> ... >>>> # slurm.conf file generated by configurator.html. >>>> # Put this file on all nodes of your cluster. >>>> # See the slurm.conf man page for more information. >>>> # >>>> ClusterName=cluster >>>> SlurmctldHost=localhost >>>> #SlurmctldHost= >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>