Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, "Groner, Rob" writes: > A quick test to see if it's a configuration error is to set > config_overrides in your slurm.conf and see if the node then responds > to scontrol update. Thanks to all who helped. It turned out that memory was the issue. I have now reseated the RAM in the offend

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Groner, Rob
users@lists.schedmd.com Subject: Re: [slurm-users] Nodes stuck in drain state That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is wha

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus
That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is what you think it is. Brian Andrus On 5/25/2023 7:30 AM, Roger Mas

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > 1. Is slurmd running on the node? Yes. > 2. What's the output of "slurmd -C" on the node? NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=6097 > 3. Define State=UP in slurm.conf in stead of UNKNOWN Will do. > 4. Why h

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Davide DelVento writes: > Can you ssh into the node and check the actual availability of memory? > Maybe there is a zombie process (or a healthy one with a memory leak > bug) that's hogging all the memory? This is what top shows: last pid: 45688; load averages: 0.00, 0.00, 0.00

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 15:23, Roger Mason wrote: NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node012 NodeHostName=node012 RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=UNKN

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Davide DelVento
Can you ssh into the node and check the actual availability of memory? Maybe there is a zombie process (or a healthy one with a memory leak bug) that's hogging all the memory? On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote: > Hello, > > Doug Meyer writes: > > > Could also review the node log

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Doug Meyer writes: > Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell > you the cause, fro example mem not matching the config. > REASON USER TIMESTAMP STATE NODELIST Low RealMemory slurm(468) 2023-05-25T09:26:59 drai

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > On 5/25/23 13:59, Roger Mason wrote: >> slurm 20.02.7 on FreeBSD. > > Uh, that's old! Yes. It is what is available in ports. > What's the output of "scontrol show node node012"? NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeat

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Doug Meyer
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config. Doug On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen wrote: > On 5/25/23 13:59, Roger Mason wrote: > > slurm 20.02.7 on FreeBSD. > > Uh, that's old! > > > I hav

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 13:59, Roger Mason wrote: slurm 20.02.7 on FreeBSD. Uh, that's old! I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then tr

[slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, slurm 20.02.7 on FreeBSD. I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then tried /usr/local/sbin/slurmctld -c scontrol update

[slurm-users] Nodes stuck in drain state and sending Invalid Argument every second

2020-02-06 Thread Dean Schulze
I moved two nodes to another controller and the two nodes will not come out of the drain state now. I've rebooted the hosts but they are still stuck in the drain state. There is nothing in the location given for saving state so I can't understand why a reboot doesn't clear this. Here's the node