After installing SLURM in Ubuntu and before starting the services, I do: mkdir -p /var/spool/slurmd mkdir -p /var/lib/slurm-llnl mkdir -p /var/lib/slurm-llnl/slurmd mkdir -p /var/lib/slurm-llnl/slurmctld mkdir -p /var/run/slurm-llnl (You need to change this to /run/slurm-llnl as your location for SlurmdPidFile and SlurmctldPidFile) mkdir -p /var/log/slurm-llnl
chmod -R 755 /var/spool/slurmd chmod -R 755 /var/lib/slurm-llnl/ chmod -R 755 /var/run/slurm-llnl/ (Also here) chmod -R 755 /var/log/slurm-llnl/ chown -R slurm:slurm /var/spool/slurmd chown -R slurm:slurm /var/lib/slurm-llnl/ chown -R slurm:slurm /var/run/slurm-llnl/ (And here) chown -R slurm:slurm /var/log/slurm-llnl/ Hope that clarifies something. My first SLURM installations failed because of missing directories and wrong permissions. Best! El mié, 17 mar 2021 a las 11:56, Brian Andrus (<toomuc...@gmail.com>) escribió: > I am guessing you aren't overly familiar with Linux/systemd since you > have the '&' at the end of your start command. > > Be that as it may, you can see it is a permissions issue. Check > permissions on /run and ensure the slurmctld user is able to write there. > > You can either change the slurmctld user to one that can write there or > change the permissions on the directory to allow the slurmctld user > write access. > > Brian Andrus > > > On 3/17/2021 11:16 AM, Sven Duscha wrote: > > Hi, > > > > I experience with SLURM slurmctld an error on Ubuntu20.04, when starting > > the service (through systemctl): > > > > > > I installed munge and SLURM version 19.05.5-1 through the package > > manager from > > the default repository: > > > > apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld > slurmd > > > > > > systemctl start slurmctld & > > [1] 2735 > > 18:55 [root@slurm ~]# systemctl status slurmctld > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; > > vendor preset: enabled) > > Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago > > Docs: man:slurmctld(8) > > Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > > (code=exited, status=0/SUCCESS) > > Tasks: 12 > > Memory: 2.5M > > CGroup: /system.slice/slurmctld.service > > └─2759 /usr/sbin/slurmctld > > > > Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon... > > Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file > > /run/slurmctld.pid (yet?) after start: Operation not permitted > > > > > > > > > > After about 60 seconds slurmctld terminates: > > > > > > -- A stop job for unit slurmctld.service has finished. > > -- > > -- The job identifier is 1043 and the job result is done. > > Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon... > > -- Subject: A start job for unit slurmctld.service has begun execution > > -- Defined-By: systemd > > -- Support: http://www.ubuntu.com/support > > -- > > -- A start job for unit slurmctld.service has begun execution. > > -- > > -- The job identifier is 1044. > > Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file > > /run/slurmctld.pid (yet?) after start: Operation not permitted > > Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation > > timed out. Terminating. > > Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result > > 'timeout'. > > > > > > > > > > My slurm.conf file lists custom PID file locations for slurmctld and > slurmd: > > /etc/slurm-llnl/slurm.conf > > > > SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid > > SlurmdPidFile=/run/slurm-llnl/slurmd.pid > > > > > > > > Starting the slurmctld executable by hand works fine: > > /usr/sbin/slurmctld & > > > > pgrep slurmctld > > 2819 > > [1]+ Done /usr/sbin/slurmctld > > pgrep slurmctld > > 2819 > > squeue > > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > > sinfo -lNe > > Wed Mar 17 19:01:45 2021 > > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK > > WEIGHT AVAIL_FE REASON > > ekgen1 1 cluster* unknown* 16 2:8:1 480000 > > 0 1 (null) none > > ekgen2 1 cluster* down* 16 2:8:1 250000 > > 0 1 (null) Not responding > > ekgen3 1 debian unknown* 16 2:8:1 250000 > > 0 1 (null) none > > ekgen4 1 cluster* unknown* 16 2:8:1 250000 > > 0 1 (null) none > > ekgen5 1 cluster* unknown* 16 2:8:1 250000 > > 0 1 (null) none > > ekgen6 1 debian unknown* 16 2:8:1 250000 > > 0 1 (null) none > > ekgen7 1 cluster* unknown* 16 2:8:1 250000 > > 0 1 (null) none > > ekgen8 1 debian down* 16 2:8:1 250000 > > 0 1 (null) Not responding > > ekgen9 1 cluster* unknown* 16 2:8:1 192000 > > 0 1 (null) none > > > > > > > > I tried then to modify /lib/systemd/system/slurmd.service > > > > cp /lib/systemd/system/slurmd.service > > /lib/systemd/system/slurmd.service.orig > > > > changed > > PIDFile=/run/slurmd.pid > > to > > PIDFile=/run/slurm-llnl/slurmd.pid > > > > systemctl start slurmctld & > > [1] 1869 > > pgrep slurm > > 1875 > > squeue > > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > > > > after ca. 60 seconds: > > > > Job for slurmctld.service failed because a timeout was exceeded. > > See "systemctl status slurmctld.service" and "journalctl -xe" for details > > > > > > - Subject: A start job for unit packagekit.service has finished > successfully > > -- Defined-By: systemd > > -- Support: http://www.ubuntu.com/support > > -- > > -- A start job for unit packagekit.service has finished successfully. > > -- > > -- The job identifier is 586. > > Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation > > timed out. Terminating. > > Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result > > 'timeout'. > > -- Subject: Unit failed > > -- Defined-By: systemd > > -- Support: http://www.ubuntu.com/support > > -- > > -- The unit slurmctld.service has entered the 'failed' state with result > > 'timeout'. > > Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller > daemon. > > -- Subject: A start job for unit slurmctld.service has failed > > -- Defined-By: systemd > > -- Support: http://www.ubuntu.com/support > > -- > > -- A start job for unit slurmctld.service has finished with a failure. > > -- > > -- The job identifier is 511 and the job result is failed. > > Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon... > > -- Subject: A start job for unit slurmctld.service has begun execution > > -- Defined-By: systemd > > -- Support: http://www.ubuntu.com/support > > -- > > -- A start job for unit slurmctld.service has begun execution. > > -- > > -- The job identifier is 662. > > Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file > > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > > Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation > > timed out. Terminating. > > Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result > > 'timeout'. > > -- Subject: Unit failed > > -- Defined-By: systemd > > -- Support: http://www.ubuntu.com/support > > > > > > > > mkdir /run/slurm-lnll/ > > chown slurm: /run/slurm-lnll/ > > > > ls -lthrd /run/slurm-lnll/ > > drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/ > > > > It doesn't create the PID file > > > > ls -lthr /run/slurm-lnll/ > > total 0 > > > > > > A work-around, writing the PID manually to the PID file, does work: > > > > systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` > > > /run/slurm-lnll/slurmctld.pid && chown slurm: > > /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid > > > > > > Still status problem reported: > > > > systemctl status slurmctld > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; > > vendor preset: enabled) > > Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s > ago > > Docs: man:slurmctld(8) > > Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > > (code=exited, status=0/SUCCESS) > > Main PID: 2287 (slurmctld) > > Tasks: 7 > > Memory: 2.3M > > CGroup: /system.slice/slurmctld.service > > └─2287 /usr/sbin/slurmctld > > > > Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon... > > Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file > > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > > Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon. > > > > > > But the slurmctld process doesn't crash anymore. Stopping the service > > does work: > > > > > > systemctl stop slurmctld.service > > systemctl status slurmctld > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; > > vendor preset: enabled) > > Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago > > Docs: man:slurmctld(8) > > Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > > (code=exited, status=0/SUCCESS) > > Main PID: 2287 (code=exited, status=0/SUCCESS) > > > > Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon... > > Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file > > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > > Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon. > > Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon... > > Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded. > > Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon. > > > > > > > > I am a little astonished that the default package shows this strange > > behaviour regarding slurmctld installed through the package manager. > > > > The base installation is Ubuntu 20.04 server installation, where I did > > no modifications apart from installing the SLURM-wlm packages and > > importing my existing configuration and munge.key. > > > > > > Best wishes, > > > > Sven Duscha > > > > > >