[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

JinSung Kang Mon, 23 Oct 2017 05:54:48 -0700

Hi

Thanks everyone for your response. I have also tested my setup to remove
nodes from the cluster, and the same thing happens.


*To answer some of the previous questions.*
"Node compute004 appears to have a different slurm.conf than the slurmctld"
error comes up when I replace slurm.conf in all the devices, but it goes
away when I restart slurmctld.

slurm version that I'm running is slurm 15.08.7

I've included the slurm.conf rather than slurmdbd.conf.

Cheers,

Jin


On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
wrote:

>
> Hi Jin,
>
> Your slurmctld.log says "Node compute004 appears to have a different
> slurm.conf than the slurmctld" etc.  This will happen if you didn't copy
> correctly the slurm.conf to the nodes.  Please correct this potential
> error.
>
> Also, please specify which version of Slurm you're running.
>
> /Ole
>
> On 10/22/2017 08:44 PM, JinSung Kang wrote:
> > I am having trouble with adding new nodes into slurm cluster without
> > killing the jobs that are currently running.
> >
> > Right now I
> >
> > 1. Update the slurm.conf and add a new node to it
> > 2. Copy new slurm.conf to all the nodes,
> > 3. Restart the slurmd on all nodes
> > 4. Restart the slurmctld
> >
> > But when I restart slurmctld all the jobs that were currently running
> > are requeued (Begin Time) as reason for not running. The new added node
> > works perfectly fine.
> >
> > I've included the slurm.conf. I've also included slurmctld.log output
> > when I'm trying to add the new node.
>

slurm.conf
Description: Binary data

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

Reply via email to