[slurm-users] Re: Slurm management of dual-node server trays?

Ole Holm Nielsen via slurm-users Thu, 07 Mar 2024 02:50:20 -0800

Hi Luke,

Thanks very much for your feedback about Lenovo SD650 V1 water-cooledservers. The new SD665 V3 also consists of 2 AMD Genoa servers in ashared tray. I have now installed RockyLinux 8.9 on the nodes and testedthe Infiniband connectivity.

Fortunately, Lenovo/Mellanox/Nvidia seem to have fixed the Infiniband"SharedIO" adapters in the latest generation (V3 ?) of servers so that thePrimary node can be rebooted or even powered off *without* causing anyInfiniband glitches on the Auxiliary (left-hand) node. This is a greatrelief to me :-)

What I did was to run a ping command on the IPoIB interface on theAuxiliary (left-hand) node to another node in the cluster, while rebootingor powering down the Primary node. Not a single IP packet was lost.


Best regards,
Ole

On 3/1/24 18:25, Luke Sudbery wrote:

We have these cards in some sd650v1 servers.
You get 2 nodes in a 1u configuration, but they are attached, you can onlypull both out of the rack at once.
Ours are slightly older, so we only have 1x 1Gb on-board per server, plus1x 200Gb HDR port on the B server, which provides a “virtual” 200G 4xHDRport on each node, although I think in practice they function as a 2xHDR100G port on each server. I think the sd655v3 will have 2x 25G SPF+ NICson each node, plus the 1 or 2 200G NDR QSFP112 ports provided by theConnectX7 card on the B server, shared with the A server.
You can totally reboot either server without affecting the other. You willjust see something like:
[ 356.799171] mlx5_core 0000:58:00.0: mlx5_fw_tracer_start:821:(pid 819):FWTracer: Ownership granted and active
As the “owner” fails over from one node to the other.

However, doing a full power off of the B server, will crash the A node:
bear-pg0206u28a: 02/24/2024 15:36:57 OS Stop, Run-time critical Stop(panic/BSOD) (Sensor 0x46)
bear-pg0206u28a: 02/24/2024 15:36:58 Critical Interrupt, Bus UncorrectableError (PCIs)
bear-pg0206u28a: 02/24/2024 15:37:01 Slot / Connector, Fault Statusasserted PCIe 1 (PCIe 1)
bear-pg0206u28a: 02/24/2024 15:37:03 Critical Interrupt, Software NMI (NMIState)
bear-pg0206u28a: 02/24/2024 15:37:07 Module / Board, State Asserted(SharedIO fail)
And this will put the fault light on. Once the B node is back, the A nodewill recover OK and you need to do virtual reseat or just restart the BMCto clear the fault.
So in day to day usage we don’t generally notice. It can be a bit of painduring outages or reinstall – obviously updating firmware on the B nodewill take out the A node too as the card needs to be reset. But they arefairly rare, and we don’t do anything special exceptrebooting/reinstalling the A nodes after the B nodes are all done to clearany errors.
Oh, and the A node doesn’t show up in the IB fabric:

[root@bear-pg0206u28a ~]# ibnetdiscover

ibwarn: [24150] _do_madrpc: send failed; Invalid argument

ibwarn: [24150] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)

/var/tmp/OFED_topdir/BUILD/rdma-core-58mlnx43/libibnetdisc/ibnetdisc.c:811; 
Failed to resolve self

ibnetdiscover: iberror: failed: discover failed

[root@bear-pg0206u28a ~]#
So we can’t use automated topology generation scripts (without a littlespecial casing).
Cheers,

Luke

--

Luke Sudbery

Principal Engineer (HPC and Storage).

Architecture, Infrastructure and Systems

Advanced Research Computing, IT Services

Room 132, Computer Centre G5, Elms Road

*Please note I don’t work on Monday.*

*From:*Sid Young via slurm-users <slurm-users@lists.schedmd.com>
*Sent:* Friday, February 23, 2024 9:49 PM
*To:* Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
*Cc:* Slurm User Community List <slurm-users@lists.schedmd.com>
*Subject:* [slurm-users] Re: Slurm management of dual-node server trays?
*CAUTION:*This email originated from outside the organisation. Do notclick links or open attachments unless you recognise the sender and knowthe content is safe.
Thats a Very interesting design and looking at the SD665 V3 documentationam I correct each node has dual 25GBs SFP28 interfaces?
If so, the despite dual nodes in a 1u configuration, you actually have 2separate servers?
Sid
On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users,<slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote:
    We're in the process of installing some racks with Lenovo SD665 V3 [1]
    water-cooled servers.  A Lenovo DW612S chassis contains 6 1U trays with 2
    SD665 V3 servers mounted side-by-side in each tray.

    Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand
    "SharedIO" adapters [2] so that one node is the Primary including a PCIe
    adapter, and the other is Auxiliary with just a cable to the Primary's
    adapter.

    Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care
    and planning.  What is worse is that when the Primary node is rebooted or
    powered down, the Auxiliary node will lose its Infiniband connection and
    may have a PCIe fault or an NMI as documented in [3].  And when nodes are
    powered up, the Primary must have completed POST before the Auxiliary
    gets
    started.  I wonder how to best deal with power failures?

    It seems that when Slurm jobs are running on Auxiliary nodes, these jobs
    are going to crash when the possibly unrelated Primary node goes down.

    This looks like a pretty bad system design on the part of Lenovo :-(  The
    goal was apparently to same some money on IB adapters and having fewer IB
    cables.

    Question: Do any Slurm sites out there already have experiences with
    Lenovo "Siamese twin" nodes with SharedIO IB?  Have you developed some
    operational strategies, for example dealing with node pairs as a single
    entity for job scheduling?

    Thanks for sharing any ideas and insights!

    Ole

    [1]
    https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server 
<https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server>
    [2]
    
https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters
 
<https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters>
    [3]
    
https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-lenovo-servers-and-storage
 
<https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-lenovo-servers-and-storage>
--Ole Holm Nielsen
    PhD, Senior HPC Officer
    Department of Physics, Technical University of Denmark


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Slurm management of dual-node server trays?

Reply via email to