Thats a Very interesting design and looking at the SD665 V3 documentation
am I correct each node has dual 25GBs SFP28 interfaces?


If so, the despite dual nodes in a 1u configuration, you actually have 2
separate servers?

Sid

On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, <
slurm-users@lists.schedmd.com> wrote:

> We're in the process of installing some racks with Lenovo SD665 V3 [1]
> water-cooled servers.  A Lenovo DW612S chassis contains 6 1U trays with 2
> SD665 V3 servers mounted side-by-side in each tray.
>
> Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand
> "SharedIO" adapters [2] so that one node is the Primary including a PCIe
> adapter, and the other is Auxiliary with just a cable to the Primary's
> adapter.
>
> Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care
> and planning.  What is worse is that when the Primary node is rebooted or
> powered down, the Auxiliary node will lose its Infiniband connection and
> may have a PCIe fault or an NMI as documented in [3].  And when nodes are
> powered up, the Primary must have completed POST before the Auxiliary gets
> started.  I wonder how to best deal with power failures?
>
> It seems that when Slurm jobs are running on Auxiliary nodes, these jobs
> are going to crash when the possibly unrelated Primary node goes down.
>
> This looks like a pretty bad system design on the part of Lenovo :-(  The
> goal was apparently to same some money on IB adapters and having fewer IB
> cables.
>
> Question: Do any Slurm sites out there already have experiences with
> Lenovo "Siamese twin" nodes with SharedIO IB?  Have you developed some
> operational strategies, for example dealing with node pairs as a single
> entity for job scheduling?
>
> Thanks for sharing any ideas and insights!
>
> Ole
>
> [1]
> https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server
> [2]
>
> https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters
> [3]
>
> https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-lenovo-servers-and-storage
>
> --
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to