I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.2
-Paul Edmon-
On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote:
Hi Paul,
On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
We've had this exact hardware for years now (all the CPU trays for
Lenovo have been dual trays for the past few generations though
previously they used a Y cable for connecting both). Basically the
way we handle it is to drain its partner node whenever one goes down
for a hardware issue.
The SD665 V3 system was announced in Nov. 2022. This V3 generation
seems to come with a single IB cable per tray with 2 nodes. In
retrospect, I would wish for independent IB adapters in each node and
an IB splitter cable (Y-cable) with 200 Gb to 2x100 Gb transceivers.
I agree that we can drain partner nodes in Slurm when servicing a node.
That said you are free to reboot either node with out loss of
connectivity. We do that all the time with no issues. As noted though
if you want to actually physically service the nodes, then you have
to take out both.
What we have experienced several times is that multi-node MPI jobs,
running on the left-hand SD665 V3 node plus other nodes in the
cluster, crash when the right-hand node is rebooted for a kernel
update or whatever reason. The right-hand node of course houses the
physical SharedIO Infiniband adapter.
My interpretation is that the IB adapter gets reset when the
right-hand node reboots, disrupting also IB traffic to the left-hand
node for a while and causing job crashes.
Have you seen any behavior like this?
Thanks,
Ole
On 8/26/2024 8:51 AM, Ole Holm Nielsen via slurm-users wrote:
We're experimenting with ways to manage our new racks of Lenovo
SD665 V3 dual-server trays with Direct Water Cooling (further
information is in our Wiki page
https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray with common
power and water cooling. This wouldn't be so bad if it weren't for
Lenovo's NVIDIA/Mellanox SharedIO Infiniband adapters, where the
left- hand node's IB adapter is a client of the right-hand node's
adapter. So we can't reboot or power down the right-hand node
without killing any MPI jobs that happen to be using the left-hand
node.
My question is if other Slurm sites owning Lenovo dual-server trays
with SharedIO Infiniband adapters have developed some clever ways of
handling such node pairs a single entity somehow? Is there anything
we should configure on the Slurm side to make such nodes easier to
manage?
Thanks for sharing any insights,
Ole
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com