Oh, also ensure the dns is working properly on the node. It could be
that it isn't able to map the name to ip of the master.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
N
Sounds like a firewall issue.
When you log on to the 'down' node, can you run 'sinfo' or 'squeue' there?
Also, verify munge is configured/running properly on the node.
Brian Andrus
On 6/4/2021 9:31 AM, Herc Silverstein wrote:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:
Hi,
The slurmctld.log shows (for this node):
...
[2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondemand
[2021-05-25T00:12:27.482] sched: Allocate JobId=3402730
NodeList=gpu-t4-4x-ondemand-44 #CPUs=1 Partition=gpu-t4-4x-ondem
On 5/19/21 9:15 pm, Herc Silverstein wrote:
Does anyone have an idea of what might be going on?
To add to the other suggestions, I would say that checking the slurmctld
and slurmd logs to see what it is saying is wrong is a good place to start.
Best of luck,
Chris
--
Chris Samuel : http
The SLURM controller AND all the compute nodes need to know who all is in
the cluster. If you want to add a node or it changes IP addresses, you need
to let all the nodes know about this which, for me, usually means
restarting slurmd on the compute nodes.
I just say this because I get caught by th
Does it tell you the reason for it being down?
sinfo -R
I have seen where a node comes up, but the amount of memory slurmd sees
is a little less than what was configured in slurm.conf.
You should always set aside some of the memory when defining it in
slurm.conf so you have memory for the oper
We had a situation recently where a desktop was turned off for a week. When
we brought it back online (in a different part of the network with a different
IP), everything came up fine (slurmd and munge).
But it kept going into DOWN* for no apparent reason (neither daemon-wise nor
log-wise).
As p
Hi,
We have a cluster (in Google gcp) which has a few partitions set up to
auto-scale, but one partition is set up to not autoscale. The desired
state is for all of the nodes in this non-autoscaled partition
(SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.
However, we