Yes, you have both daemons, installed with the slurm rpm.The slurmd (all nodes) communicates with slurmctld (runs in the main master node and, optionally, in a backup node).
You do not need to run slurmd as the slurm user. Use `systemctld enable slurmctld` (and slurmd) followed by `systemclt start slurmctld`. Use restart instead of start if you change the configuration only if `sudo scontrol reconfigure` asks for it. If you run as root `slurmctld -Dvvvv` and `slurmd -Dvvvv` you'll see debug outputs to see further problems with configuration. The slurmd needs slurmctld running or will output "error: Unable to register: Unable to contact slurm controller (connect failure)" You should find the services here: -rw-r--r-- 1 root root 339 may 30 20:18 /usr/lib/systemd/system/slurmctld.service -rw-r--r-- 1 root root 342 may 30 20:18 /usr/lib/systemd/system/slurmdbd.service -rw-r--r-- 1 root root 398 may 30 20:18 /usr/lib/systemd/system/slurmd.service Feel free to ask for more information, Best regards El mar., 2 jun. 2020 a las 11:12, Ferran Planas Padros (<ferran.pad...@su.se>) escribió: > > Hi Ole, > > > Thanks for your answer and your time. I'd appreciate if you, or someone > else, could make a final look at my case. > > After your suggestions and comments, I have re-done the whole installation > for Munge and Slurm. I uninstalled and remoced all previous rpms and > restarted from scratch. Munge works with no problem, however it does not > happen the same with slurm (for which I have used the instructions given in > the link you attached) > > > - If I run /usr/bin/slurmd -D vvvvv as root user, I get the verbose until > the line 'slurmd: debug2: No acct_gather.conf file > (/etc/slurm/acct_gather.conf)' where the verbose stops. After I do > Ctrl+C, I get > > > slurmd: all threads complete > > slurmd: Consumable Resources (CR) Node Selection plugin shutting down ... > > slurmd: Munge cryptographic signature plugin unloaded > > slurmd: Slurmd shutdown completing > > - After that, if I run 'systemctl start slurmd' and 'systemctl status > slurmd', also as root user, I get: > > *●* slurmd.service - Slurm node daemon > > Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor > preset: disabled) > > Active: *active (running)* since Tue 2020-06-02 16:53:51 CEST; 33s ago > > Process: 2750 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd > $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) > > Main PID: 2752 (slurmd) > > CGroup: /system.slice/slurmd.service > > └─2752 /usr/sbin/slurmd -d /usr/sbin/slurmstepd > > > Jun 02 16:53:51 roos21.organ.su.se systemd[1]: Starting Slurm node > daemon... > > Jun 02 16:53:51 roos21.organ.su.se systemd[1]: Can't open PID file > /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory > > Jun 02 16:53:51 roos21.organ.su.se systemd[1]: Started Slurm node daemon. > > - Next, I kill the slurmd process, and I run, as slurm user, 'systemctl > start slurm'. Which does not work and returns the following in the > journalctl -xe: > > > Jun 02 16:56:01 roos21.organ.su.se systemd[1]: Starting LSB: slurm daemon > management... > > -- Subject: Unit slurm.service has begun start-up > > -- Defined-By: systemd > > -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel > > -- > > -- Unit slurm.service has begun starting up. > > Jun 02 16:56:01 roos21.organ.su.se slurm[2805]: starting slurmd: [ OK ] > > Jun 02 16:56:01 roos21.organ.su.se systemd[1]: Can't open PID file > /var/run/slurmctld.pid (yet?) after start: No such file or directory > > Jun 02 16:56:37 roos21.organ.su.se polkitd[1316]: *Unregistered > Authentication Agent for unix-process:2792:334647 (system bus name :1.46, > object path /org/freedesktop* > > Jun 02 16:56:38 roos21.organ.su.se sudo[2790]: pam_unix(sudo:session): > session closed for user slurm > > Something that I don't really understand because I have not installed > slurmctld. The slurmctld.service file does not even exist. > > > Any idea? > > > Many thanks, > > Ferran > > > > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > *Sent:* Tuesday, June 2, 2020 12:03:27 PM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8 > > Hi Ferran, > > Please install Slurm software in the standard way, see > https://wiki.fysik.dtu.dk/niflheim/Slurm_installation > > It seems that you have some unusual way to manage your Linux systems. In > Stockholm and Sweden there are many Slurm experts at the HPC centers which > might be able to help you more directly. > > Best regards, > Ole > > On 6/2/20 11:58 AM, Ferran Planas Padros wrote: > > I did a fresh installation with the EPEL repo, and installing munge from > > it and it worked. To have the slurm user for munge was definitely a > > problem, but that is the set up we have on the CentOS 6. Now I've learnt > > my lesson for future installations, thanks to everyone! > > > > > > Now, I have a follow up question, if you don't mind. I am now trying to > > run slurm, and it crashes: > > > > > > [root@roos21 ~]# systemctl status slurm.service > > > > *●*slurm.service - LSB: slurm daemon management > > > > Loaded: loaded (/etc/rc.d/init.d/slurm; bad; vendor preset: disabled) > > > > Active: *failed*(Result: protocol) since Tue 2020-06-02 11:45:33 CEST; > > 3min 33s ago > > > > Docs: man:systemd-sysv-generator(8) > > > > > > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: Starting LSB: slurm > daemon > > management... > > > > Jun 02 11:45:33 roos21.organ.su.se slurm[18223]: starting slurmd: [OK] > > > > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: Can't open PID file > > /var/run/slurmctld.pid (yet?) after start: No such file or directory > > > > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: *Failed to start LSB: > slurm > > daemon management.* > > > > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: *Unit slurm.service > entered > > failed state.* > > > > Jun 02 11:45:33 roos21.organ.su.se systemd[1]: *slurm.service failed.* > > > > > > > > The thing is that this is a computing node, not the master node, so > > slurmctld is not installed. Why do I get this error? > > > > > > Many thanks, and my apologies for this rather simple questions. I am a > > newbie on this. > > > > > > Best, > > > > Ferran > > > > > -------------------------------------------------------------------------- > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf > of > > Renata Maria Dart <ren...@slac.stanford.edu> > > *Sent:* Friday, May 29, 2020 6:33:58 PM > > *To:* ole.h.niel...@fysik.dtu.dk; Slurm User Community List > > *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8 > > Hi, don't know if this might be your problem but I ran into an issue > > on centos 7.8 where /var/run/munge was not being created at boottime > > because I didn't have the munge user in the local password file. I > > have the munge user in AD and once the system is up I can start munge > > successfully, but AD wasn't available early enough during boot for the > > munge startup to see it. I added these lines to the munge systemctl > > file: > > > > PermissionsStartOnly=true > > ExecStartPre=-/usr/bin/mkdir -m 0755 -p /var/run/munge > > ExecStartPre=-/usr/bin/chown -R munge:munge /var/run/munge > > > > and my system now starts munge up fine during a reboot. > > > > Renata > > > > On Fri, 29 May 2020, Ole Holm Nielsen wrote: > > > >> Hi Ferran, > >> > >> When you have a CentOS 7 system with the EPEL repo enabled, and you have > >> installed the munge RPM from EPEL, then things should be working > correctly. > >> > >> Since systemctl tells you that Munge service didn't start correctly, > then it > >> seems to me that you have a problem in the general configuration of > your CentOS > >> 7 system. You should check /var/log/messages and "journalctl -xe" for > munge > >> errors. It is really hard for other people to guess what may be wrong > in your > >> system. > >> > >> My 2 cents worth: Maybe you could make a fresh CentOS 7.8 installation > on a > >> test system and install the Munge service (and nothing else) according > to > >> instructions in https://wiki.fysik.dtu.dk/niflheim/Slurm_installation. > This > >> *really* has got to work! > >> > >> /Ole > >> > >> > >> On 29-05-2020 10:23, Ferran Planas Padros wrote: > >>> Hello everyone, > >>> > >>> > >>> Here it comes everything I've done. > >>> > >>> > >>> - About Ole's answer: > >>> > >>> Yes, we have slurm as the user to control munge. Following your > comment, I > >>> have changed the ownership of the munge files and tried to start munge > as > >>> munge user. However, it also failed. > >>> > >>> Also, I first installed munge from a repository. I've seen your > suggestion of > >>> installing from EPEL. So I uninstalled and installed again. Same result > >>> > >>> - About SELinux: It is disables > >>> > >>> - The output of ps -ef | grep munge is: > >>> > >>> > >>> root534051530 10:18 pts/000:00:00 grep --color=auto *munge* > >>> > >>> > >>> - The outputs of munge -n is: > >>> > >>> > >>> Failed to access "/var/run/munge/munge.socket.2": No such file or > directory > >>> > >>> > >>> - Same for unmunge > >>> > >>> > >>> - Output for sudo systemctl status --full munge > >>> > >>> > >>> *?*munge.service - MUNGE authentication service > >>> > >>> Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor > preset: > >>> disabled) > >>> > >>> Active: *failed*(Result: exit-code) since Fri 2020-05-29 10:15:52 > CEST; 4min > >>> 18s ago > >>> > >>> Docs: man:munged(8) > >>> > >>> Process: 5333 ExecStart=/usr/sbin/munged *(code=exited, > status=1/FAILURE)* > >>> > >>> > >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: Starting MUNGE > authentication > >>> service... > >>> > >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *munge.service: > control process > >>> exited, code=exited status=1* > >>> > >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *Failed to start MUNGE > >>> authentication service.* > >>> > >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *Unit munge.service > entered > >>> failed state.* > >>> > >>> May 29 10:15:52 roos21.organ.su.se systemd[1]: *munge.service failed.* > >>> > >>> > >>> - Regarding NTP, I get this message: > >>> > >>> > >>> Unable to talk to NTP daemon. Is it running? > >>> > >>> > >>> It is the same message I get in the nodes that DO work. All nodes are > sync in > >>> time and date with the central node > >>> > >>> > >>> > ------------------------------------------------------------------------ > >>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf > of Ole > >>> Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > >>> *Sent:* Friday, May 29, 2020 9:56:10 AM > >>> *To:* slurm-users@lists.schedmd.com > >>> *Subject:* Re: [slurm-users] Problem with permisions. CentOS 7.8 > >>> On 29-05-2020 08:46, Sudeep Narayan Banerjee wrote: > >>>> also check: > >>>> a) whether NTP has been setup and communicating with master node > >>>> b) iptables may be flushed (iptables -L) > >>>> c) SeLinux to disabled, to check : > >>>> getenforce > >>>> vim /etc/sysconfig/selinux > >>>> (change SELINUX=enforcing to SELINUX=disabled and save the file and > reboot) > >>> > >>> There is no reason to disable SELinux for running the Munge service. > >>> It's a pretty bad idea to lower the security just for the sake of > >>> convenience! > >>> > >>> /Ole > >>> > >>> > >>>> On Fri, May 29, 2020 at 12:08 PM Sudeep Narayan Banerjee > >>>> <snbaner...@iitgn.ac.in <mailto:snbaner...@iitgn.ac.in > <snbaner...@iitgn.ac.in>>> wrote: > >>>> > >>>> I have not checked on the CentOS7.8 > >>>> a) if /var/run/munge folder does not exist then please double > check > >>>> whether munge has been installed or not > >>>> b) user root or sudo user to do > >>>> ps -ef | grep munge > >>>> kill -9 <PID> //where PID is the Process ID for munge (if the > >>>> process is running at all); else > >>>> > >>>> which munged > >>>> /etc/init.d/munge start > >>>> > >>>> please let me know the the output of: > >>>> > >>>> |$ munge -n| > >>>> > >>>> |$ munge -n | unmunge| > >>>> > >>>> |$ sudo systemctl status --full munge > >>>> > >>>> | > >>>> > >>>> Thanks & Regards, > >>>> Sudeep Narayan Banerjee > >>>> System Analyst | Scientist B > >>>> Indian Institute of Technology Gandhinagar > >>>> Gujarat, INDIA > >>>> > >>>> > >>>> On Fri, May 29, 2020 at 11:55 AM Bjørn-Helge Mevik > >>>> <b.h.me...@usit.uio.no <mailto:b.h.me...@usit.uio.no > <b.h.me...@usit.uio.no>>> wrote: > >>>> > >>>> Ferran Planas Padros <ferran.pad...@su.se > >>>> <mailto:ferran.pad...@su.se <ferran.pad...@su.se>>> writes: > >>>> > >>>> > I run the command as slurm user, and the /var/log/munge > >>>> folder does belong to slurm. > >>>> > >>>> For security reasons, I strongly advise that you run munged > as a > >>>> separate user, which is unprivileged and not used for > anything else. > >>>> > >>>> -- Regards, > >>>> Bjørn-Helge Mevik, dr. scient, > >>>> Department for Research Computing, University of Oslo > >