Hello,

These days we have now enabled topology aware scheduling on our Slurm cluster. 
One part of the cluster consists of two racks of AMD compute nodes. These racks 
are, now, treated as separate entities by Slurm. Soon, we may add another set 
of AMD nodes with slightly difference CPU specs to the existing nodes. We'll 
aim to balance the new nodes across the racks re cooling/heating requirements. 
The new nodes will be controlled by a new partition.

Does anyone know if it is possible to regard the two racks as a single entity 
(by connecting the InfiniBand switches together), and so schedule jobs across 
the resources in the racks with no loss efficiency. I would be grateful for 
your comments and ideas, please. The alternative is to put all the new nodes in 
a completely new rack, but that does mean that we'll have purchase some new 
Ethernet and IB switches. We are not happy, by the way, to have node/switch 
connections across racks.

Best regards,
David

Reply via email to