Hi, I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running correctly; however the slurmctld daemon always errors. [admin@localhost ~]$ systemctl status slurmd.service ● slurmd.service - Slurm node daemon Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago Main PID: 2363 (slurmd) Tasks: 2 Memory: 3.4M CPU: 211ms CGroup: /system.slice/slurmd.service └─2363 /usr/local/sbin/slurmd -D Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node daemon. [admin@localhost ~]$ systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─override.conf Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST; 11min ago Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 1972 (code=exited, status=1/FAILURE) CPU: 21ms Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller daemon. Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Failed with result 'exit-code'.
The slurmctld log is as follows: [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster cluster [2020-12-14T16:02:12.739] No memory enforcing mechanism configured. [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve "localhost" [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not supported [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost [2020-12-14T16:02:12.772] Recovered state of 1 nodes [2020-12-14T16:02:12.772] Recovered information about 0 jobs [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions [2020-12-14T16:02:12.779] Recovered state of 0 reservations [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions [2020-12-14T16:02:12.779] Running as primary controller [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set. [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)" [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port without address family [2020-12-14T16:02:12.782] error: Error creating slurm stream socket: Address family not supported by protocol [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address family not supported by protocol Strangely, the daemon works fine when it is rebooted. After running systemctl restart slurmctld.service the service status is [admin@localhost ~]$ systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─override.conf Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago Main PID: 2815 (slurmctld) Tasks: 7 Memory: 1.9M CPU: 15ms CGroup: /system.slice/slurmctld.service └─2815 /usr/local/sbin/slurmctld -D Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller daemon. Could anyone point me towards how to fix this? I expect it's just an issue with my configuration file, which I've copied below for reference. # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # #SlurmctldHost=localhost ControlMachine=localhost # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/home/slurm/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/home/slurm/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/slurmd/ SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/home/slurm/spool/ SwitchType=switch/none TaskPlugin=task/affinity # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=cluster #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none #SlurmctldDebug=info SlurmctldLogFile=/home/slurm/log/slurmctld.log #SlurmdDebug=info #SlurmdLogFile= # # # COMPUTE NODES NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP Thanks! -John