Hello,

I have some questions about adding nodes in configless mode. My version of SLURM is 21.08.5. I gave logs below to ease the read of the message.

First, is "scontrol reconfigure" equal to "scontrol reconfig" ?

Then, I have a strange behaviour at node addition.

I have an healthy cluster with two nodes :

nico@control-node-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
COMPUTE*     up   infinite      2   idle hpc-node-[0-1]

My config file is :

nico@control-node-0:~$ cat /etc/slurm/slurm.conf
ClusterName=slurmtest
ControlMachine=control-node-0
SlurmUser=slurm
AuthType=auth/munge
StateSaveLocation=/var/slurm/
SlurmdSpoolDir=/var/slurm/
SlurmctldPidFile=/var/slurm/slurmctld.pid
SlurmdPidFile=/var/slurm/slurmd.pid
SlurmdLogFile=/var/slurm/slurmd.log
SlurmctldLogFile=/var/slurm/slurmctld.log
#SlurmctldParameters=enable_configless,cloud_dns
SlurmctldParameters=enable_configless
CommunicationParameters=NoAddrCache
ProctrackType=proctrack/linuxproc

NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN
PartitionName=COMPUTE Nodes=hpc-node-[0-1] Default=YES MaxTime=INFINITE State=UP

If I submit a job, it's ok :

nico@control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1

Now, I try to submit a job too large for my cluster :

nico@control-node-0:~$ srun -N 3 hostname
srun: Requested partition configuration not available now
srun: job 10 queued and waiting for resources

It's pending.

I add a new compute node in config file so, Nodename becomes :

NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

At this state the DNS is OK, everybody resolves everybody.

I send a TERM signal to slurmctld :

root@control-node-0:/# kill -TERM `cat /var/slurm/slurmctld.pid`

log 1 below

And I restart it :

slurmctld -D -f /etc/slurm/slurm.conf

Few seconds later, my job fails (log 2) with :

srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2, check slurm.conf

Only hostnames of hpc-node-0 and hpc-node-1 are displayed.

I guess it's because the slurm.conf is not updated on compute nodes, so my nodes don't know hpc-node-2 even if he is resolveable. Here is the Nodename section of compute nodes :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

But the node is visible on my slurmctld (cause it has been sigtermed i think) :

nico@control-node-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
COMPUTE*     up   infinite      3   idle hpc-node-[0-2]

And if I submit the job again, it works :

nico@control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-2
hpc-node-1

Despite the fact that conf has not been updated on compute nodes :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

Is this a normal behaviour ?

Note :

If i perform a "scontrol reconfig" compute node configuration is updated :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

I try a second scenario. I go back to the two nodes configuration and submit a too large job like above. I also add a third node. And this time instead of sending a TERM, I make a "scontrol reconfig" instead of sending a TERM signal.

root@control-node-0:/# scontrol reconfig
slurm_reconfigure error: Zero Bytes were transmitted or received

Nothing is updated.

I run this command a second time and immediatly the job fails with
the same error as before in log 2.

But, this time, the configuration on compute nodes is updated :

root@hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

If I run the job a second time, it's OK :

nico@control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-1
hpc-node-2

I think I'm missing something in adding new compute nodes. Can you tell me about best practices to add compute nodes with a configless configuration ?

Than you !

Kind regards,

==== LOGS =====

log 1
-----
[2022-01-17T02:45:06.809] Terminate signal (SIGINT or SIGTERM) received
[2022-01-17T02:45:06.854] Saving all slurm state
[2022-01-17T02:45:06.922] error: Configured MailProg is invalid
[2022-01-17T02:45:06.922] slurmctld version 21.08.5 started on cluster slurmtest
[2022-01-17T02:45:06.924] No memory enforcing mechanism configured.
[2022-01-17T02:45:06.931] Recovered state of 2 nodes
[2022-01-17T02:45:06.931] Recovered JobId=10 Assoc=0
[2022-01-17T02:45:06.931] Recovered information about 1 jobs
[2022-01-17T02:45:06.931] Recovered state of 0 reservations
[2022-01-17T02:45:06.931] read_slurm_conf: backup_controller not specified
[2022-01-17T02:45:06.931] Running as primary controller
[2022-01-17T02:45:06.931] No parameter for mcs plugin, default values set
[2022-01-17T02:45:06.931] mcs: MCSParameters = (null). ondemand set.
[2022-01-17T02:45:06.933] error: slurm_receive_msg [192.168.104.56:49698]: Zero Bytes were transmitted or received [2022-01-17T02:45:10.004] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2022-01-17T02:45:10.010] error: Node hpc-node-1 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2022-01-17T02:45:10.010] error: Node hpc-node-0 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2022-01-17T02:45:36.927] sched/backfill: _start_job: Started JobId=10 in COMPUTE on hpc-node-[0-2]

log 2
------
srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2, check slurm.conf srun: error: Task launch for StepId=10.0 failed on node hpc-node-2: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
hpc-node-0
hpc-node-1
srun: error: Timed out waiting for job step to complete

--
Nicolas Greneche
USPN
Support à la recherche / RSSI
https://www-magi.univ-paris13.fr

Reply via email to