Hi Jin,
I think that I always do your steps 3,4 in the opposite order: Restart
slurmctld, then slurmd on nodes:
> 3. Restart the slurmd on all nodes
> 4. Restart the slurmctld
Since you run a very old Slurm 15.08, perhaps you should upgrade 15.08
-> 16.05 -> 17.02. Soon there will be a 17.11. FYI: I wrote some notes
about upgrading:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
/Ole
On 10/23/2017 02:55 PM, JinSung Kang wrote:
Hi
Thanks everyone for your response. I have also tested my setup to remove
nodes from the cluster, and the same thing happens.
*To answer some of the previous questions.*
"Node compute004 appears to have a different slurm.conf than the
slurmctld" error comes up when I replace slurm.conf in all the devices,
but it goes away when I restart slurmctld.
slurm version that I'm running is slurm 15.08.7
I've included the slurm.conf rather than slurmdbd.conf.
Cheers,
Jin
On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsen
<[email protected] <mailto:Ole.H.Nhttps://wiki.fysik.dtu.dk/niflheim/Slurm_installation#[email protected]>> wrote:
Hi Jin,
Your slurmctld.log says "Node compute004 appears to have a different
slurm.conf than the slurmctld" etc. This will happen if you didn't copy
correctly the slurm.conf to the nodes. Please correct this
potential error.
Also, please specify which version of Slurm you're running.
/Ole
On 10/22/2017 08:44 PM, JinSung Kang wrote:
> I am having trouble with adding new nodes into slurm cluster without
> killing the jobs that are currently running.
>
> Right now I
>
> 1. Update the slurm.conf and add a new node to it
> 2. Copy new slurm.conf to all the nodes,
> 3. Restart the slurmd on all nodes
> 4. Restart the slurmctld
>
> But when I restart slurmctld all the jobs that were currently running
> are requeued (Begin Time) as reason for not running. The new
added node
> works perfectly fine.
>
> I've included the slurm.conf. I've also included slurmctld.log output
> when I'm trying to add the new node.