localhost is the ctrl name :) I can change it though if needed (I was lazy when I did the initial installation).
Thanks! Jeff On Thu, Dec 8, 2022 at 2:30 PM Glen MacLachlan <macl...@gwu.edu> wrote: > One other thing to address is that SlurmctldHost should point to the > controller node where slurmctld is running, the name of which I would > expect Warewulf would put into /etc/hosts. > https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost > > Kind Regards, > Glen > > ========================================== > Glen MacLachlan, PhD > *Lead High Performance Computing Engineer * > > Research Technology Services > The George Washington University > 44983 Knoll Square > Enterprise Hall, 328L > Ashburn, VA 20147 > > ========================================== > > > > > > > > On Thu, Dec 8, 2022 at 1:58 PM Glen MacLachlan <macl...@gwu.edu> wrote: > >> >> What does running this on the compute node show? (looks at journal log >> for past 12 hours) >> journalctl -S -12h -o verbose | grep slurm >> >> >> You may want to increase your debug verbosity to debug5 >> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking >> down this issue. >> For reference, see >> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug >> >> You should also address this error to fix logging: >> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied >> >> by making a directory /var/log/slurm and making the slurm user the owner >> on both the controller and compute node. Then update your slurm.conf file >> like this: >> # LOGGING >> SlurmctldLogFile=/var/log/slurm/slurmctld.log >> SlurmdLogFile=/var/log/slurm/slurmd.log >> >> and then running 'scontrol reconfigure' >> >> Kind Regards, >> Glen >> >> ========================================== >> Glen MacLachlan, PhD >> *Lead High Performance Computing Engineer * >> >> Research Technology Services >> The George Washington University >> 44983 Knoll Square >> Enterprise Hall, 328L >> Ashburn, VA 20147 >> >> ========================================== >> >> >> >> >> >> >> >> On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <layto...@gmail.com> wrote: >> >>> Good afternoon, >>> >>> I have a very simple two node cluster using Warewulf 4.3. I was >>> following some instructions on how to install the OpenHPC Slurm binaries >>> (server and client). I booted the compute node and the Slurm Server says >>> it's in an unknown state. This hasn't happened to me before but I would >>> like to debug the problem. >>> >>> I checked the services on the S:urm server (head node) >>> >>> $ systemctl status munge >>> ● munge.service - MUNGE authentication service >>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; >>> vendor preset: disabled) >>> Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s >>> ago >>> Docs: man:munged(8) >>> Process: 1140 ExecStart=/usr/sbin/munged (code=exited, >>> status=0/SUCCESS) >>> Main PID: 1182 (munged) >>> Tasks: 4 (limit: 48440) >>> Memory: 1.2M >>> CGroup: /system.slice/munge.service >>> └─1182 /usr/sbin/munged >>> >>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE >>> authentication service... >>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE >>> authentication service. >>> >>> $ systemctl status slurmctld >>> ● slurmctld.service - Slurm controller daemon >>> Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; >>> vendor preset: disabled) >>> Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s >>> ago >>> Main PID: 1518 (slurmctld) >>> Tasks: 10 >>> Memory: 23.0M >>> CGroup: /system.slice/slurmctld.service >>> ├─1518 /usr/sbin/slurmctld -D -s >>> └─1555 slurmctld: slurmscriptd >>> >>> Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm >>> controller daemon. >>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No >>> parameter for mcs plugin, de> >>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: >>> MCSParameters = (null). on> >>> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: >>> SchedulerParameters=default_que> >>> >>> >>> >>> I then booted the compute node and checked the services there: >>> >>> systemctl status munge >>> ● munge.service - MUNGE authentication service >>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; >>> vendor preset: disabled) >>> Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s >>> ago >>> Docs: man:munged(8) >>> Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) >>> Main PID: 804 (munged) >>> Tasks: 4 (limit: 26213) >>> Memory: 940.0K >>> CGroup: /system.slice/munge.service >>> └─804 /usr/sbin/munged >>> >>> Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication >>> service... >>> Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service. >>> >>> systemctl status slurmd >>> ● slurmd.service - Slurm node daemon >>> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >>> vendor preset: disabled) >>> Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC; >>> 2min 40s ago >>> Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS >>> (code=exited, status=1/FAILURE) >>> Main PID: 897 (code=exited, status=1/FAILURE) >>> >>> Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon. >>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, >>> code=exited, status=1/FAIL> >>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result >>> 'exit-code'. >>> >>> # systemctl status slurmd >>> ● slurmd.service - Slurm node daemon >>> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >>> vendor preset: disabled) >>> Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago >>> Main PID: 996 (slurmd) >>> Tasks: 2 >>> Memory: 1012.0K >>> CGroup: /system.slice/slurmd.service >>> ├─996 /usr/sbin/slurmd -D -s --conf-server localhost >>> └─997 /usr/sbin/slurmd -D -s --conf-server localhost >>> >>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. >>> >>> >>> >>> >>> On the SLurm server I checked the queue and "sinfo -a" and found the >>> following: >>> >>> $ squeue >>> JOBID PARTITION NAME USER ST TIME NODES >>> NODELIST(REASON) >>> $ sinfo -a >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> normal* up 1-00:00:00 1 unk* n0001 >>> >>> >>> After a few moments (less than a minute - maybe 20-30 seconds, slurmd on >>> the compute node fails. WHen I checked the service I saw this: >>> >>> $ systemctl status slurmd >>> ● slurmd.service - Slurm node daemon >>> Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; >>> vendor preset: disabled) >>> Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC; >>> 10min ago >>> Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS >>> (code=exited, status=1/FAILURE) >>> Main PID: 996 (code=exited, status=1/FAILURE) >>> >>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon. >>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, >>> code=exited, status=1/FAIL> >>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result >>> 'exit-code'. >>> >>> >>> Below are the logs for the slurm server for today (I rebooted the >>> compute twice) >>> >>> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied >>> [2022-12-08T13:12:17.343] error: Configured MailProg is invalid >>> [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster >>> cluster >>> [2022-12-08T13:12:17.371] No memory enforcing mechanism configured. >>> [2022-12-08T13:12:17.374] Recovered state of 1 nodes >>> [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0 >>> [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0 >>> [2022-12-08T13:12:17.374] Recovered information about 2 jobs >>> [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: >>> select/cons_tres: preparing for 1 partitions >>> [2022-12-08T13:12:17.375] Recovered state of 0 reservations >>> [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not >>> specified >>> [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: >>> select/cons_tres: reconfigure >>> [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: >>> select/cons_tres: preparing for 1 partitions >>> [2022-12-08T13:12:17.376] Running as primary controller >>> [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set >>> [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set. >>> [2022-12-08T13:13:17.471] >>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 >>> [2022-12-08T13:17:17.940] error: Nodes n0001 not responding >>> [2022-12-08T13:22:17.533] error: Nodes n0001 not responding >>> [2022-12-08T13:27:17.048] error: Nodes n0001 not responding >>> >>> There are no logs on the compute node. >>> >>> Any suggestions where to start looking? I think I'm seeing the trees and >>> not the forest :) >>> >>> Thanks! >>> >>> Jeff >>> >>> P.S Here's some relevant features from the server slurm.conf >>> >>> >>> # slurm.conf file generated by configurator.html. >>> # Put this file on all nodes of your cluster. >>> # See the slurm.conf man page for more information. >>> # >>> ClusterName=cluster >>> SlurmctldHost=localhost >>> #SlurmctldHost= >>> ... >>> # slurm.conf file generated by configurator.html. >>> # Put this file on all nodes of your cluster. >>> # See the slurm.conf man page for more information. >>> # >>> ClusterName=cluster >>> SlurmctldHost=localhost >>> #SlurmctldHost= >>> >>> >>> >>> >>> Here's some relevant parts of slurm.conf on the client node: >>> >>> >>> >>> >>> # slurm.conf file generated by configurator.html. >>> # Put this file on all nodes of your cluster. >>> # See the slurm.conf man page for more information. >>> # >>> ClusterName=cluster >>> SlurmctldHost=localhost >>> #SlurmctldHost= >>> ... >>> # slurm.conf file generated by configurator.html. >>> # Put this file on all nodes of your cluster. >>> # See the slurm.conf man page for more information. >>> # >>> ClusterName=cluster >>> SlurmctldHost=localhost >>> #SlurmctldHost= >>> >>> >>> >>> >>> >>> >>> >>>