There can be few possibilities: 1. Check if munge is working properly. From the scheduler master run "munge -n | ssh ecpsc10 unmunge" 2. Check if selinux is enforced 3. Check if firewalld or similar firewall is enabled 4. Check the logs /var/log/slurm/slurmctld.log or slurmd.log on the compute node
Best, On Tue, Nov 16, 2021 at 10:12 AM Jaep Emmanuel <emmanuel.j...@epfl.ch> wrote: > Hi, > > > > It might be a newbie question since I'm new to slurm. > > I'm trying to restart the slurmd service on one of our Ubuntu box. > > > > The slurmd.service is defined by: > > > > [Unit] > > Description=Slurm node daemon > > After=network.target munge.service > > ConditionPathExists=/etc/slurm/slurm.conf > > > > [Service] > > Type=forking > > EnvironmentFile=-/etc/sysconfig/slurmd > > ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS > > ExecReload=/bin/kill -HUP $MAINPID > > PIDFile=/var/run/slurmd.pid > > KillMode=process > > LimitNOFILE=51200 > > LimitMEMLOCK=infinity > > LimitSTACK=infinity > > > > [Install] > > WantedBy=multi-user.target > > > > > > The service start without issue (systemctl start slurmd.service). > > However, when checking the status of the service, I get a couple of error > messages, but nothing alarming: > > > > ~# systemctl status slurmd.service > > ● slurmd.service - Slurm node daemon > > Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor > preset: enabled) > > Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago > > Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd > $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) > > Main PID: 2713021 (slurmd) > > Tasks: 1 (limit: 134845) > > Memory: 1.9M > > CGroup: /system.slice/slurmd.service > > └─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd > > > > Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon... > > Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file > /run/slurmd.pid (yet?) after start: Operation not pe> > > Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon. > > > > Unfortunately, the node is still seen as down when a issue a 'sinfo': > > root@ecpsc10:~# sinfo > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > > Compute up infinite 2 idle ecpsc[11-12] > > Compute up infinite 1 down ecpsc10 > > FastCompute* up infinite 2 idle ecpsf[10-11] > > > > When I get the details on this node, I get the following details: > > root@ecpsc10:~# scontrol show node ecpsc10 > > NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8 > > CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00 > > AvailableFeatures=(null) > > ActiveFeatures=(null) > > Gres=(null) > > NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11 > > OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC > 2021 > > RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1 > > State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > > Partitions=Compute > > BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01 > > CfgTRES=cpu=16,mem=40195M,billing=16 > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04] > > > > > > From the reason, I get that the daemon won't reload because the machine > was rebooted. > > However, the /etc/slurm/slurm.conf looks like: > > > > root@ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice > > ReturnToService=2 > > > > > > So I'm quite puzzled on the reason why the node will not go back online. > > > > Any help will be greatly appreciated. > > > > Best, > > > > Emmanuel > -- Hadrian Djohari Manager of Research Computing Services, [U]Tech Case Western Reserve University (W): 216-368-0395 (M): 216-798-7490