Thats a Very interesting design and looking at the SD665 V3 documentation am I correct each node has dual 25GBs SFP28 interfaces?
If so, the despite dual nodes in a 1u configuration, you actually have 2 separate servers? Sid On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, < slurm-users@lists.schedmd.com> wrote: > We're in the process of installing some racks with Lenovo SD665 V3 [1] > water-cooled servers. A Lenovo DW612S chassis contains 6 1U trays with 2 > SD665 V3 servers mounted side-by-side in each tray. > > Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand > "SharedIO" adapters [2] so that one node is the Primary including a PCIe > adapter, and the other is Auxiliary with just a cable to the Primary's > adapter. > > Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care > and planning. What is worse is that when the Primary node is rebooted or > powered down, the Auxiliary node will lose its Infiniband connection and > may have a PCIe fault or an NMI as documented in [3]. And when nodes are > powered up, the Primary must have completed POST before the Auxiliary gets > started. I wonder how to best deal with power failures? > > It seems that when Slurm jobs are running on Auxiliary nodes, these jobs > are going to crash when the possibly unrelated Primary node goes down. > > This looks like a pretty bad system design on the part of Lenovo :-( The > goal was apparently to same some money on IB adapters and having fewer IB > cables. > > Question: Do any Slurm sites out there already have experiences with > Lenovo "Siamese twin" nodes with SharedIO IB? Have you developed some > operational strategies, for example dealing with node pairs as a single > entity for job scheduling? > > Thanks for sharing any ideas and insights! > > Ole > > [1] > https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server > [2] > > https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters > [3] > > https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-lenovo-servers-and-storage > > -- > Ole Holm Nielsen > PhD, Senior HPC Officer > Department of Physics, Technical University of Denmark > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com