Hello,
"Groner, Rob" writes:
> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.
Thanks to all who helped. It turned out that memory was the issue. I
have now reseated the RAM in the offend
users@lists.schedmd.com
Subject: Re: [slurm-users] Nodes stuck in drain state
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is wha
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is what you think
it is.
Brian Andrus
On 5/25/2023 7:30 AM, Roger Mas
Ole Holm Nielsen writes:
> 1. Is slurmd running on the node?
Yes.
> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097
> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.
> 4. Why h
Hello,
Davide DelVento writes:
> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?
This is what top shows:
last pid: 45688; load averages: 0.00, 0.00, 0.00
On 5/25/23 15:23, Roger Mason wrote:
NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKN
Can you ssh into the node and check the actual availability of memory?
Maybe there is a zombie process (or a healthy one with a memory leak bug)
that's hogging all the memory?
On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote:
> Hello,
>
> Doug Meyer writes:
>
> > Could also review the node log
Hello,
Doug Meyer writes:
> Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell
> you the cause, fro example mem not matching the config.
>
REASON USER TIMESTAMP STATE NODELIST
Low RealMemory slurm(468) 2023-05-25T09:26:59 drai
Ole Holm Nielsen writes:
> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
Yes. It is what is available in ports.
> What's the output of "scontrol show node node012"?
NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeat
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will
tell you the cause, fro example mem not matching the config.
Doug
On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen
wrote:
> On 5/25/23 13:59, Roger Mason wrote:
> > slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
>
> > I hav
On 5/25/23 13:59, Roger Mason wrote:
slurm 20.02.7 on FreeBSD.
Uh, that's old!
I have a couple of nodes stuck in the drain state. I have tried
scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume
without success.
I then tr
Hello,
slurm 20.02.7 on FreeBSD.
I have a couple of nodes stuck in the drain state. I have tried
scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume
without success.
I then tried
/usr/local/sbin/slurmctld -c
scontrol update
I moved two nodes to another controller and the two nodes will not come out
of the drain state now. I've rebooted the hosts but they are still stuck
in the drain state. There is nothing in the location given for saving
state so I can't understand why a reboot doesn't clear this.
Here's the node
13 matches
Mail list logo