Thanks for the quick reply. check if munge is working properly
root@ecpsinf01:~# munge -n | ssh ecpsc10 unmunge Warning: the ECDSA host key for 'ecpsc10' differs from the key for the IP address '128.178.242.136' Offending key for IP in /root/.ssh/known_hosts:5 Matching host key in /root/.ssh/known_hosts:28 Are you sure you want to continue connecting (yes/no)? yes STATUS: Success (0) ENCODE_HOST: ecpsc10 (127.0.1.1) ENCODE_TIME: 2021-11-16 16:57:56 +0100 (1637078276) DECODE_TIME: 2021-11-16 16:58:10 +0100 (1637078290) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 Check if SE linux is enforced controller node root@ecpsinf01:~# getenforce -bash: getenforce: command not found root@ecpsinf01:~# sestatus -bash: sestatus: command not found compute node root@ecpsc10:~# getenforce Command 'getenforce' not found, but can be installed with: apt install selinux-utils root@ecpsc10:~# sestatus Command 'sestatus' not found, but can be installed with: apt install policycoreutils Check slurm log file [2021-11-16T16:19:54.646] debug: Log file re-opened [2021-11-16T16:19:54.666] Message aggregation disabled [2021-11-16T16:19:54.666] topology NONE plugin loaded [2021-11-16T16:19:54.666] route default plugin loaded [2021-11-16T16:19:54.667] CPU frequency setting not configured for this node [2021-11-16T16:19:54.667] debug: Resource spec: No specialized cores configured by default on this node [2021-11-16T16:19:54.667] debug: Resource spec: Reserved system memory limit not configured for this node [2021-11-16T16:19:54.667] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2021-11-16T16:19:54.667] debug: Ignoring obsolete CgroupReleaseAgentDir option. [2021-11-16T16:19:54.669] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2021-11-16T16:19:54.670] debug: Ignoring obsolete CgroupReleaseAgentDir option. [2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated cores [2021-11-16T16:19:54.670] debug: task/cgroup/memory: total:112428M allowed:100%(enforced), swap:0%(permissive), max:100%(112428M) max+swap:100%(224856M) min:30M kmem:100%(112428M enforced) min:30M swappiness:0(unset) [2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated memory [2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated devices [2021-11-16T16:19:54.670] debug: task/cgroup: loaded [2021-11-16T16:19:54.671] debug: Munge authentication plugin loaded [2021-11-16T16:19:54.671] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2021-11-16T16:19:54.671] Munge cryptographic signature plugin loaded [2021-11-16T16:19:54.673] slurmd version 17.11.12 started [2021-11-16T16:19:54.673] debug: Job accounting gather cgroup plugin loaded [2021-11-16T16:19:54.674] debug: job_container none plugin loaded [2021-11-16T16:19:54.674] debug: switch NONE plugin loaded [2021-11-16T16:19:54.674] slurmd started on Tue, 16 Nov 2021 16:19:54 +0100 [2021-11-16T16:19:54.675] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=112428 TmpDisk=224253 Uptime=1911799 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-11-16T16:19:54.675] debug: AcctGatherEnergy NONE plugin loaded [2021-11-16T16:19:54.675] debug: AcctGatherProfile NONE plugin loaded [2021-11-16T16:19:54.675] debug: AcctGatherInterconnect NONE plugin loaded [2021-11-16T16:19:54.676] debug: AcctGatherFilesystem NONE plugin loaded check if firewalld is enable No From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Hadrian Djohari <hx...@case.edu> Reply to: Slurm User Community List <slurm-users@lists.schedmd.com> Date: Tuesday, 16 November 2021 at 16:56 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Unable to start slurmd service There can be few possibilities: 1. Check if munge is working properly. From the scheduler master run "munge -n | ssh ecpsc10 unmunge" 2. Check if selinux is enforced 3. Check if firewalld or similar firewall is enabled 4. Check the logs /var/log/slurm/slurmctld.log or slurmd.log on the compute node Best, On Tue, Nov 16, 2021 at 10:12 AM Jaep Emmanuel <emmanuel.j...@epfl.ch<mailto:emmanuel.j...@epfl.ch>> wrote: Hi, It might be a newbie question since I'm new to slurm. I'm trying to restart the slurmd service on one of our Ubuntu box. The slurmd.service is defined by: [Unit] Description=Slurm node daemon After=network.target munge.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity [Install] WantedBy=multi-user.target The service start without issue (systemctl start slurmd.service). However, when checking the status of the service, I get a couple of error messages, but nothing alarming: ~# systemctl status slurmd.service ● slurmd.service - Slurm node daemon Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 2713021 (slurmd) Tasks: 1 (limit: 134845) Memory: 1.9M CGroup: /system.slice/slurmd.service └─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon... Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe> Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon. Unfortunately, the node is still seen as down when a issue a 'sinfo': root@ecpsc10:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Compute up infinite 2 idle ecpsc[11-12] •Compute up infinite 1 down ecpsc10 FastCompute* up infinite 2 idle ecpsf[10-11] When I get the details on this node, I get the following details: root@ecpsc10:~# scontrol show node ecpsc10 NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11 OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021 RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=Compute BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01 CfgTRES=cpu=16,mem=40195M,billing=16 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04] From the reason, I get that the daemon won't reload because the machine was rebooted. However, the /etc/slurm/slurm.conf looks like: root@ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice ReturnToService=2 So I'm quite puzzled on the reason why the node will not go back online. Any help will be greatly appreciated. Best, Emmanuel -- Hadrian Djohari Manager of Research Computing Services, [U]Tech Case Western Reserve University (W): 216-368-0395 (M): 216-798-7490