I would also encourage you to use defaults in the slurm.conf (matching what's shipped in the Ubuntu packages). However, here is what I've done to use non-Ubuntu-package paths for the PID files.
Create an override in /etc/systemd/system/slurmd.service.d/override.conf with something like: node32[~]: cat /etc/systemd/system/slurmd.service.d/override.conf [Service] PIDFile=/var/run/slurm-llnl/slurmd.pid RuntimeDirectory=slurm-llnl RuntimeDirectoryMode=0775 Replace the daemon name as necessary. The "runtimedirectory" is needed because /run and /var/run are virtual file systems managed by systemd. Creating that directory "by hand" has unpredictable results. HTH - Michael On Thu, Mar 18, 2021 at 4:52 AM Sven Duscha <sven.dus...@tum.de> wrote: > Hi, > > thanks for all the responses. > > On 18.03.21 11:29, Stefan Staeglich wrote: > > I think it makes more sense to adjust the config file > /etc/slurm-llnl/slurm.conf > > and not the systemd units: > > SlurmctldPidFile=/run/slurmctld.pid > > SlurmdPidFile=/run/slurmd.pid > > > That was of course my first approach. I had used the directory > /run/slurm-lnll/ on my CentOS 7 installations, where I copied the > slurm.conf file from over. > > It turned out that those directories I defined there weren't used. The > error message suggested that slurmctld still tried to write to > /run/slurmctld.pid. > > Changing the systemd file was my last resort. And as mentioned I don't > expect to have to do that much fiddling with an (relative old 19.05-5) > package manager version. It seems "snap" provides a more current > version 20.02.1: > > > snap install slurm # version 20.02.1, or > apt install slurm-client # version 19.05.5-1 > > > The underlying distribution installation also hasn't been modified by > me, I want to use Ubuntu20.04 as my future cluster OS, and the > kvm-virtualized SLURM controller was the first I tried. > > > Brian Andrus suggested: > > On 17.03.21 21:32, Brian Andrus wrote: > > That is looking like your /run folder does not have world execute > > permissions, making it impossible for anything to access sub-directories. > > But I can write as user "sven" (I didn't set up the LDAP connection, > yet) in a subdirectory of /run/slurm-lnll, if it belongs to user "sven". > > > Furthermore, I used the option "SlurmUser=slurm" in my slurm.conf file, > because it is good practice to not use root. Changing this to "root", > which should give universal access to all directories, doesn't make a > difference: > > #SlurmUser=slurm > SlurmdUser=root > > > My initial response, that /var/run/slurm-lnll/slurmctld.pid worked me; > was also premature. It kind of works for the first start after a reboot > with > > systemctl start slurmctld > > and > > systemctl stop slurmctld > > works, but then lingers around in the timeout. During that time > slurmctld still runs, I see the process, and can use squeue, sinfo etc. > > After the pid file writing timeout it shows the service to be > terminated. This time not due to the inability of writing the > slurmctld.pid file, but instead suggesting my modification to the legacy > location /var/run - which itself is only a reference to /run: > > Mar 18 12:30:43 slurm systemd[1]: Reloading. > Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5: > ListenStream= references a path below legacy directory /var/run/, > updating /var/> > Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12: > PIDFile= references a path below legacy directory /var/run/, updating > /var/r> > Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > > > time systemctl start slurmctld > Job for slurmctld.service failed because a timeout was exceeded. > See "systemctl status slurmctld.service" and "journalctl -xe" for details. > > real 1m1.314s > user 0m0.003s > sys 0m0.002s > > -- A session with the ID 1 has been terminated. > Mar 18 12:30:43 slurm systemd[1]: Reloading. > Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5: > ListenStream= references a path below legacy directory /var/run/, > updating /var/> > Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12: > PIDFile= references a path below legacy directory /var/run/, updating > /var/r> > Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > > > The initial "&" I put after the systemctl, because I wanted to get to my > prompt to investigate the problem. Normal behaviour, as I expect it, > would be a starting time of 1-2 seconds. > > > I am back to my work-around: > > systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` > > /run/slurm-lnll/slurmctld.pid && chown slurm: > /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid > > > My configuration file is read, though, as I can check with scontrol: > > scontrol show config | grep run > SlurmdPidFile = /var/run/slurm-llnl/slurmd.pid > SlurmctldPidFile = /var/run/slurm-llnl/slurmctld.pid > > > So, all of this hassle shouldn't occur, my fiddling with systemd should > be entirely unnecessary. > > Mar 18 12:37:13 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > > > Unmodified systemd file: > > [Unit] > Description=Slurm controller daemon > After=network.target munge.service > ConditionPathExists=/etc/slurm-llnl/slurm.conf > Documentation=man:slurmctld(8) > > [Service] > Type=forking > EnvironmentFile=-/etc/default/slurmctld > ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > ExecReload=/bin/kill -HUP $MAINPID > PIDFile=/run/slurm-lnll/slurmctld.pid > LimitNOFILE=65536 > TasksMax=infinity > > [Install] > WantedBy=multi-user.target > ~ > > > I do know some file permissions issues, I encountered on CentOS-7, but > by all apparent means, i.e. checking the permissions, it should work > with those permissions in the subdirectory > > ls -lthrd /run/slurm-lnll/ > drwxrwxr-x 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/ > > > But this suggests, it ignores the setting in the slurm.conf file: > > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > > > -- The job identifier is 2259. > Mar 18 12:41:34 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: start operation > timed out. Terminating. > Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: Failed with result > 'timeout'. > > > Though scontrol show config claims otherwise: > > scontrol show config | grep run > SlurmdPidFile = /var/run/slurm-llnl/slurmd.pid > SlurmctldPidFile = /var/run/slurm-llnl/slurmctld.pid > SrunEpilog = (null) > SrunPortRange = 0-0 > SrunProlog = (null) > > > I would attribute it to my fault, but I started yesterday with a > "vanilla" installation of Ubuntu20.04 server, and the purpose of this VM > is only to run sclurmctld. > > > This "should" occur to many more people, or I am missing something > obvious. If it was to permissions, making the directory /run/slurm-lnll > world-wirteable: > > ls -lthrd /run/slurm-lnll/ > drwxrwxrwx 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/ > > should "fix" the problem. I could live with that, even though I try to > adhere to strict permission management. > > That also doesn't work > > Mar 18 12:46:33 slurm systemd[1]: slurmctld.service: Can't open PID file > /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted > Mar 18 12:46:38 slurm systemd[1]: Reloading. > > > So, I am turning in circles here. > > > Best wishes, > > Sven > > > -- > Sven Duscha > Deutsches Herzzentrum München > Technische Universität München > Lazarettstraße 36 > 80636 München > +49 89 1218 2602 > > >