Hi Thanks everyone for your response. I have also tested my setup to remove nodes from the cluster, and the same thing happens.
*To answer some of the previous questions.* "Node compute004 appears to have a different slurm.conf than the slurmctld" error comes up when I replace slurm.conf in all the devices, but it goes away when I restart slurmctld. slurm version that I'm running is slurm 15.08.7 I've included the slurm.conf rather than slurmdbd.conf. Cheers, Jin On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote: > > Hi Jin, > > Your slurmctld.log says "Node compute004 appears to have a different > slurm.conf than the slurmctld" etc. This will happen if you didn't copy > correctly the slurm.conf to the nodes. Please correct this potential > error. > > Also, please specify which version of Slurm you're running. > > /Ole > > On 10/22/2017 08:44 PM, JinSung Kang wrote: > > I am having trouble with adding new nodes into slurm cluster without > > killing the jobs that are currently running. > > > > Right now I > > > > 1. Update the slurm.conf and add a new node to it > > 2. Copy new slurm.conf to all the nodes, > > 3. Restart the slurmd on all nodes > > 4. Restart the slurmctld > > > > But when I restart slurmctld all the jobs that were currently running > > are requeued (Begin Time) as reason for not running. The new added node > > works perfectly fine. > > > > I've included the slurm.conf. I've also included slurmctld.log output > > when I'm trying to add the new node. >
slurm.conf
Description: Binary data