We're experimenting with ways to manage our new racks of Lenovo SD665 V3
dual-server trays with Direct Water Cooling (further information is in our
Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray with common power
and water cooling. This wouldn't be so bad if it weren't for Lenovo's
NVIDIA/Mellanox SharedIO Infiniband adapters, where the left-hand node's
IB adapter is a client of the right-hand node's adapter. So we can't
reboot or power down the right-hand node without killing any MPI jobs that
happen to be using the left-hand node.
My question is if other Slurm sites owning Lenovo dual-server trays with
SharedIO Infiniband adapters have developed some clever ways of handling
such node pairs a single entity somehow? Is there anything we should
configure on the Slurm side to make such nodes easier to manage?
Thanks for sharing any insights,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com