[slurm-users] Questions about scontrol reconfigure / reconfig

Nicolas Greneche Sun, 16 Jan 2022 19:44:15 -0800

Hello,

I have some questions about adding nodes in configless mode. My versionof SLURM is 21.08.5. I gave logs below to ease the read of the message.


First, is "scontrol reconfigure" equal to "scontrol reconfig" ?

Then, I have a strange behaviour at node addition.

I have an healthy cluster with two nodes :

nico@control-node-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
COMPUTE*     up   infinite      2   idle hpc-node-[0-1]

My config file is :

nico@control-node-0:~$ cat /etc/slurm/slurm.conf
ClusterName=slurmtest
ControlMachine=control-node-0
SlurmUser=slurm
AuthType=auth/munge
StateSaveLocation=/var/slurm/
SlurmdSpoolDir=/var/slurm/
SlurmctldPidFile=/var/slurm/slurmctld.pid
SlurmdPidFile=/var/slurm/slurmd.pid
SlurmdLogFile=/var/slurm/slurmd.log
SlurmctldLogFile=/var/slurm/slurmctld.log
#SlurmctldParameters=enable_configless,cloud_dns
SlurmctldParameters=enable_configless
CommunicationParameters=NoAddrCache
ProctrackType=proctrack/linuxproc

NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

PartitionName=COMPUTE Nodes=hpc-node-[0-1] Default=YES MaxTime=INFINITEState=UP


If I submit a job, it's ok :

nico@control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1

Now, I try to submit a job too large for my cluster :

nico@control-node-0:~$ srun -N 3 hostname
srun: Requested partition configuration not available now
srun: job 10 queued and waiting for resources

It's pending.

I add a new compute node in config file so, Nodename becomes :

NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

At this state the DNS is OK, everybody resolves everybody.

I send a TERM signal to slurmctld :

root@control-node-0:/# kill -TERM `cat /var/slurm/slurmctld.pid`

log 1 below

And I restart it :

slurmctld -D -f /etc/slurm/slurm.conf

Few seconds later, my job fails (log 2) with :

srun: job 10 has been allocated resources

srun: error: fwd_tree_thread: can't find address for host hpc-node-2,check slurm.conf


Only hostnames of hpc-node-0 and hpc-node-1 are displayed.

I guess it's because the slurm.conf is not updated on compute nodes, somy nodes don't know hpc-node-2 even if he is resolveable. Here is theNodename section of compute nodes :


root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

But the node is visible on my slurmctld (cause it has been sigtermed ithink) :


nico@control-node-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
COMPUTE*     up   infinite      3   idle hpc-node-[0-2]

And if I submit the job again, it works :

nico@control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-2
hpc-node-1

Despite the fact that conf has not been updated on compute nodes :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

Is this a normal behaviour ?

Note :

If i perform a "scontrol reconfig" compute node configuration is updated :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

I try a second scenario. I go back to the two nodes configuration andsubmit a too large job like above. I also add a third node. And thistime instead of sending a TERM, I make a "scontrol reconfig" instead ofsending a TERM signal.


root@control-node-0:/# scontrol reconfig
slurm_reconfigure error: Zero Bytes were transmitted or received

Nothing is updated.

I run this command a second time and immediatly the job fails with
the same error as before in log 2.

But, this time, the configuration on compute nodes is updated :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

If I run the job a second time, it's OK :

nico@control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-1
hpc-node-2

I think I'm missing something in adding new compute nodes. Can you tellme about best practices to add compute nodes with a configlessconfiguration ?


Than you !

Kind regards,

==== LOGS =====

log 1
-----
[2022-01-17T02:45:06.809] Terminate signal (SIGINT or SIGTERM) received
[2022-01-17T02:45:06.854] Saving all slurm state
[2022-01-17T02:45:06.922] error: Configured MailProg is invalid

[2022-01-17T02:45:06.922] slurmctld version 21.08.5 started on clusterslurmtest

[2022-01-17T02:45:06.924] No memory enforcing mechanism configured.
[2022-01-17T02:45:06.931] Recovered state of 2 nodes
[2022-01-17T02:45:06.931] Recovered JobId=10 Assoc=0
[2022-01-17T02:45:06.931] Recovered information about 1 jobs
[2022-01-17T02:45:06.931] Recovered state of 0 reservations
[2022-01-17T02:45:06.931] read_slurm_conf: backup_controller not specified
[2022-01-17T02:45:06.931] Running as primary controller
[2022-01-17T02:45:06.931] No parameter for mcs plugin, default values set
[2022-01-17T02:45:06.931] mcs: MCSParameters = (null). ondemand set.

[2022-01-17T02:45:06.933] error: slurm_receive_msg[192.168.104.56:49698]: Zero Bytes were transmitted or received[2022-01-17T02:45:10.004]SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2[2022-01-17T02:45:10.010] error: Node hpc-node-1 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files and make surethey are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2022-01-17T02:45:10.010] error: Node hpc-node-0 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files and make surethey are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2022-01-17T02:45:36.927] sched/backfill: _start_job: Started JobId=10in COMPUTE on hpc-node-[0-2]


log 2
------
srun: job 10 has been allocated resources

srun: error: fwd_tree_thread: can't find address for host hpc-node-2,check slurm.confsrun: error: Task launch for StepId=10.0 failed on node hpc-node-2:Can't find an address, check slurm.confsrun: error: Application launch failed: Can't find an address, checkslurm.conf

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
hpc-node-0
hpc-node-1
srun: error: Timed out waiting for job step to complete

--
Nicolas Greneche
USPN
Support à la recherche / RSSI
https://www-magi.univ-paris13.fr

[slurm-users] Questions about scontrol reconfigure / reconfig

Reply via email to