I am guessing you aren't overly familiar with Linux/systemd since you
have the '&' at the end of your start command.
Be that as it may, you can see it is a permissions issue. Check
permissions on /run and ensure the slurmctld user is able to write there.
You can either change the slurmctld user to one that can write there or
change the permissions on the directory to allow the slurmctld user
write access.
Brian Andrus
On 3/17/2021 11:16 AM, Sven Duscha wrote:
Hi,
I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
the service (through systemctl):
I installed munge and SLURM version 19.05.5-1 through the package
manager from
the default repository:
apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd
systemctl start slurmctld &
[1] 2735
18:55 [root@slurm ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
Docs: man:slurmctld(8)
Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
Tasks: 12
Memory: 2.5M
CGroup: /system.slice/slurmctld.service
└─2759 /usr/sbin/slurmctld
Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurmctld.pid (yet?) after start: Operation not permitted
After about 60 seconds slurmctld terminates:
-- A stop job for unit slurmctld.service has finished.
--
-- The job identifier is 1043 and the job result is done.
Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
-- Subject: A start job for unit slurmctld.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has begun execution.
--
-- The job identifier is 1044.
Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
My slurm.conf file lists custom PID file locations for slurmctld and slurmd:
/etc/slurm-llnl/slurm.conf
SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/run/slurm-llnl/slurmd.pid
Starting the slurmctld executable by hand works fine:
/usr/sbin/slurmctld &
pgrep slurmctld
2819
[1]+ Done /usr/sbin/slurmctld
pgrep slurmctld
2819
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
sinfo -lNe
Wed Mar 17 19:01:45 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
ekgen1 1 cluster* unknown* 16 2:8:1 480000
0 1 (null) none
ekgen2 1 cluster* down* 16 2:8:1 250000
0 1 (null) Not responding
ekgen3 1 debian unknown* 16 2:8:1 250000
0 1 (null) none
ekgen4 1 cluster* unknown* 16 2:8:1 250000
0 1 (null) none
ekgen5 1 cluster* unknown* 16 2:8:1 250000
0 1 (null) none
ekgen6 1 debian unknown* 16 2:8:1 250000
0 1 (null) none
ekgen7 1 cluster* unknown* 16 2:8:1 250000
0 1 (null) none
ekgen8 1 debian down* 16 2:8:1 250000
0 1 (null) Not responding
ekgen9 1 cluster* unknown* 16 2:8:1 192000
0 1 (null) none
I tried then to modify /lib/systemd/system/slurmd.service
cp /lib/systemd/system/slurmd.service
/lib/systemd/system/slurmd.service.orig
changed
PIDFile=/run/slurmd.pid
to
PIDFile=/run/slurm-llnl/slurmd.pid
systemctl start slurmctld &
[1] 1869
pgrep slurm
1875
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
after ca. 60 seconds:
Job for slurmctld.service failed because a timeout was exceeded.
See "systemctl status slurmctld.service" and "journalctl -xe" for details
- Subject: A start job for unit packagekit.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit packagekit.service has finished successfully.
--
-- The job identifier is 586.
Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit slurmctld.service has entered the 'failed' state with result
'timeout'.
Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.
-- Subject: A start job for unit slurmctld.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has finished with a failure.
--
-- The job identifier is 511 and the job result is failed.
Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...
-- Subject: A start job for unit slurmctld.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has begun execution.
--
-- The job identifier is 662.
Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
mkdir /run/slurm-lnll/
chown slurm: /run/slurm-lnll/
ls -lthrd /run/slurm-lnll/
drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/
It doesn't create the PID file
ls -lthr /run/slurm-lnll/
total 0
A work-around, writing the PID manually to the PID file, does work:
systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
/run/slurm-lnll/slurmctld.pid && chown slurm:
/run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid
Still status problem reported:
systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago
Docs: man:slurmctld(8)
Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
Main PID: 2287 (slurmctld)
Tasks: 7
Memory: 2.3M
CGroup: /system.slice/slurmctld.service
└─2287 /usr/sbin/slurmctld
Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
But the slurmctld process doesn't crash anymore. Stopping the service
does work:
systemctl stop slurmctld.service
systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago
Docs: man:slurmctld(8)
Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
Main PID: 2287 (code=exited, status=0/SUCCESS)
Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...
Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.
Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.
I am a little astonished that the default package shows this strange
behaviour regarding slurmctld installed through the package manager.
The base installation is Ubuntu 20.04 server installation, where I did
no modifications apart from installing the SLURM-wlm packages and
importing my existing configuration and munge.key.
Best wishes,
Sven Duscha