We're experimenting with ways to manage our new racks of Lenovo SD665 V3 dual-server trays with Direct Water Cooling (further information is in our Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )

Management problems arise because 2 servers share a tray with common power and water cooling. This wouldn't be so bad if it weren't for Lenovo's NVIDIA/Mellanox SharedIO Infiniband adapters, where the left-hand node's IB adapter is a client of the right-hand node's adapter. So we can't reboot or power down the right-hand node without killing any MPI jobs that happen to be using the left-hand node.

My question is if other Slurm sites owning Lenovo dual-server trays with SharedIO Infiniband adapters have developed some clever ways of handling such node pairs a single entity somehow? Is there anything we should configure on the Slurm side to make such nodes easier to manage?

Thanks for sharing any insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to