And we have ignition - thank you very much! :-)
On Mon, May 11, 2020 at 8:44 PM Alex Chekholko <a...@calicolabs.com> wrote: > Any time a node goes into DRAIN state you need to manually intervene and > put it back into service. > scontrol update nodename=ip-172-31-80-232 state=resume > > On Mon, May 11, 2020 at 11:40 AM Joakim Hove <joakim.h...@gmail.com> > wrote: > >> >> You’re on the right track with the DRAIN state. The more specific answer >>> is in the “Reason=” description on the last line. >>> >>> It looks like your node has less memory than what you’ve defined for the >>> node in slurm.conf >>> >> >> Thank you; that sounded meaningful to me. My slurm.conf file had >> RealMemory=983 whereas "slurmd -C" showed "RealMemory=978" - so you are >> right; the actual node had less available memory than what I configured in >> slurm.conf - I guess the reason for the difference is slightly different >> AWS nodes? Anyay I updated the slurm.conf with "RealMemory=512" - i.e. with >> a wide margin less than the what the node actually has. After restarting >> slurmctld / slurmd I now get: >> >> ubuntu@ip-172-31-80-232:~/opm-portal/aws$ scontrol show node >> NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1 >> CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00 >> AvailableFeatures=(null) >> ActiveFeatures=(null) >> Gres=(null) >> NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=17.11 >> OS=Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC >> 2020 >> RealMemory=512 AllocMem=0 FreeMem=254 Sockets=1 Boards=1 >> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A >> MCS_label=N/A >> Partitions=debug >> BootTime=2020-05-11T17:02:15 SlurmdStartTime=2020-05-11T18:29:30 >> CfgTRES=cpu=1,mem=512M,billing=1 >> AllocTRES= >> CapWatts=n/a >> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 >> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >> Reason=Low RealMemory [root@2020-05-11T16:20:02] >> >> I.e. slurm has recognized the new memory setting, but the state is still >> "IDLE+DRAIN" - and no jobs start running :-( >> >> >> >> >>