Hello,
I have a question related to the cloud feature or a feature that can solve an 
issue that I have with my cluster,to make it simple let say that I have a set 
of nodes ( let say 10 nodes ), if needed I move node/s from cluster A to 
cluster B and in my slurm.conf I define all the possible number of available 
nodes:

Cluster A
NodeName=clusterA-[001-010]

Cluster B
NodeName=clusterB-[001-010]

In normal operation I have 5 nodes in 'cluster A' and 5 in 'cluster B', but in 
case of needs I reboot a node of 'cluster B' in 'cluster A', and the result 
will be 4 nodes in 'cluster B' and 6 in 'cluster A'.
The "issue" is that since I specified all possible nodes in slurm.conf, when I 
ran sinfo what I see is:

Cluster A
Normal up 1-00:00:00 5 up clusterA-[01-05]
Normal up 1-00:00:00 5 down* clusterA-[06-10]
 
Cluster B
Normal up 1-00:00:00 5 up clusterB-[06-10]
Normal up 1-00:00:00 5 down* clusterB-[01-5]

And in both slurmctld.log I have the message:

error: Unable to resolve "clusterA-006": Unknown host

or 

error: Unable to resolve "clusterB-001": Unknown host

Since I have a lot of partitions and a lot of nodes, the sinfo it is much more 
complicated to read due to the DOWN nodes that are actually not present in the 
system, is there a way/feature/option that wont display in the sinfo nodes that 
are actually NOT present and reachable by the slurmctld due to the  "error: 
Unable to resolve "clusterA-006": Unknown host " ?

Basically I'd like to have in both slurm.conf all the possible nodes but the 
sinfo should shows:

Cluster A
Normal up 1-00:00:00 5 up clusterA-[01-05]

Cluster B
Normal up 1-00:00:00 5 up clusterB-[06-10]

And If I move a node once the node is actually reachable:

Cluster A
Normal up 1-00:00:00 6 up clusterA-[01-06]

Cluster B
Normal up 1-00:00:00 4 up clusterB-[07-10]

Thanks
Fabio

--
- Fabio Verzelloni - CSCS - Swiss National Supercomputing Centre
via Trevano 131 - 6900 Lugano, Switzerland
Tel: +41 (0)91 610 82 04
 

Reply via email to