Hi Sven, I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf and not the systemd units: SlurmctldPidFile=/run/slurmctld.pid SlurmdPidFile=/run/slurmd.pid
Best, Stefan Am Mittwoch, 17. März 2021, 19:16:38 CET schrieb Sven Duscha: > Hi, > > I experience with SLURM slurmctld an error on Ubuntu20.04, when starting > the service (through systemctl): > > > I installed munge and SLURM version 19.05.5-1 through the package > manager from > the default repository: > > apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd > > > systemctl start slurmctld & > [1] 2735 > 18:55 [root@slurm ~]# systemctl status slurmctld > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; > vendor preset: enabled) > Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago > Docs: man:slurmctld(8) > Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCESS) > Tasks: 12 > Memory: 2.5M > CGroup: /system.slice/slurmctld.service > └─2759 /usr/sbin/slurmctld > > Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon... > Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurmctld.pid (yet?) after start: Operation not permitted > > > > > After about 60 seconds slurmctld terminates: > > > -- A stop job for unit slurmctld.service has finished. > -- > -- The job identifier is 1043 and the job result is done. > Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon... > -- Subject: A start job for unit slurmctld.service has begun execution > -- Defined-By: systemd > -- Support: http://www.ubuntu.com/support > -- > -- A start job for unit slurmctld.service has begun execution. > -- > -- The job identifier is 1044. > Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurmctld.pid (yet?) after start: Operation not permitted > Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > > > > > My slurm.conf file lists custom PID file locations for slurmctld and slurmd: > /etc/slurm-llnl/slurm.conf > > SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid > SlurmdPidFile=/run/slurm-llnl/slurmd.pid > > > > Starting the slurmctld executable by hand works fine: > /usr/sbin/slurmctld & > > pgrep slurmctld > 2819 > [1]+ Done /usr/sbin/slurmctld > pgrep slurmctld > 2819 > squeue > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > sinfo -lNe > Wed Mar 17 19:01:45 2021 > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK > WEIGHT AVAIL_FE REASON > ekgen1 1 cluster* unknown* 16 2:8:1 480000 > 0 1 (null) none > ekgen2 1 cluster* down* 16 2:8:1 250000 > 0 1 (null) Not responding > ekgen3 1 debian unknown* 16 2:8:1 250000 > 0 1 (null) none > ekgen4 1 cluster* unknown* 16 2:8:1 250000 > 0 1 (null) none > ekgen5 1 cluster* unknown* 16 2:8:1 250000 > 0 1 (null) none > ekgen6 1 debian unknown* 16 2:8:1 250000 > 0 1 (null) none > ekgen7 1 cluster* unknown* 16 2:8:1 250000 > 0 1 (null) none > ekgen8 1 debian down* 16 2:8:1 250000 > 0 1 (null) Not responding > ekgen9 1 cluster* unknown* 16 2:8:1 192000 > 0 1 (null) none > > > > I tried then to modify /lib/systemd/system/slurmd.service > > cp /lib/systemd/system/slurmd.service > /lib/systemd/system/slurmd.service.orig > > changed > PIDFile=/run/slurmd.pid > to > PIDFile=/run/slurm-llnl/slurmd.pid > > systemctl start slurmctld & > [1] 1869 > pgrep slurm > 1875 > squeue > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > > after ca. 60 seconds: > > Job for slurmctld.service failed because a timeout was exceeded. > See "systemctl status slurmctld.service" and "journalctl -xe" for details > > > - Subject: A start job for unit packagekit.service has finished successfully > -- Defined-By: systemd > -- Support: http://www.ubuntu.com/support > -- > -- A start job for unit packagekit.service has finished successfully. > -- > -- The job identifier is 586. > Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > -- Subject: Unit failed > -- Defined-By: systemd > -- Support: http://www.ubuntu.com/support > -- > -- The unit slurmctld.service has entered the 'failed' state with result > 'timeout'. > Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon. > -- Subject: A start job for unit slurmctld.service has failed > -- Defined-By: systemd > -- Support: http://www.ubuntu.com/support > -- > -- A start job for unit slurmctld.service has finished with a failure. > -- > -- The job identifier is 511 and the job result is failed. > Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon... > -- Subject: A start job for unit slurmctld.service has begun execution > -- Defined-By: systemd > -- Support: http://www.ubuntu.com/support > -- > -- A start job for unit slurmctld.service has begun execution. > -- > -- The job identifier is 662. > Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > -- Subject: Unit failed > -- Defined-By: systemd > -- Support: http://www.ubuntu.com/support > > > > mkdir /run/slurm-lnll/ > chown slurm: /run/slurm-lnll/ > > ls -lthrd /run/slurm-lnll/ > drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/ > > It doesn't create the PID file > > ls -lthr /run/slurm-lnll/ > total 0 > > > A work-around, writing the PID manually to the PID file, does work: > > systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` > > /run/slurm-lnll/slurmctld.pid && chown slurm: > /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid > > > Still status problem reported: > > systemctl status slurmctld > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; > vendor preset: enabled) > Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago > Docs: man:slurmctld(8) > Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCESS) > Main PID: 2287 (slurmctld) > Tasks: 7 > Memory: 2.3M > CGroup: /system.slice/slurmctld.service > └─2287 /usr/sbin/slurmctld > > Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon... > Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon. > > > But the slurmctld process doesn't crash anymore. Stopping the service > does work: > > > systemctl stop slurmctld.service > systemctl status slurmctld > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; > vendor preset: enabled) > Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago > Docs: man:slurmctld(8) > Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCESS) > Main PID: 2287 (code=exited, status=0/SUCCESS) > > Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon... > Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon. > Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon... > Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded. > Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon. > > > > I am a little astonished that the default package shows this strange > behaviour regarding slurmctld installed through the package manager. > > The base installation is Ubuntu 20.04 server installation, where I did > no modifications apart from installing the SLURM-wlm packages and > importing my existing configuration and munge.key. > > > Best wishes, > > Sven Duscha -- Stefan Stäglich, Universität Freiburg, Institut für Informatik Georges-Köhler-Allee, Geb.52, 79110 Freiburg, Germany E-Mail : staeg...@informatik.uni-freiburg.de WWW : gki.informatik.uni-freiburg.de Telefon: +49 761 203-54216 Fax : +49 761 203-8222