[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

Paul Edmon via slurm-users Mon, 26 Aug 2024 12:37:24 -0700

We built our stack using helmod, which is an extension of LMOD using rpmspec files. Our spec for openmpi can be found here:https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/rocky8/openmpi-5.0.2-fasrc01.spec

I've tested with both Intel and GCC and have seen no issues (we useReframe for our testing: https://github.com/fasrc/reframe-fasrc).


-Paul Edmon-

On 8/26/2024 3:28 PM, Ole Holm Nielsen via slurm-users wrote:

On 26-08-2024 20:30, Paul Edmon via slurm-users wrote:
I haven't seen any behavior like that. For reference we are runningRocky 8.9 with MOFED 23.10.2
That's interesting! Our nodes run Rocky 8.10 and have installed theMellanox driver tar-ballMLNX_OFED_LINUX-24.04-0.7.0.0-rhel8.9-x86_64.tgz. That's close toyour setup! User applications may use any MPI package, but verylikely OpenMPI/4.1.5-GCC-12.3.0 from the latest EasyBuild softwaremodules.
It seems that we need to make some more careful testing of multi-nodeMPI jobs while taking SD665 V3 nodes down.
I wonder if there's any additional OpenMPI or Slurm configuration inyour setup, such as building Slurm --with pmix?
Thanks,
Ole
On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote:
Hi Paul,

On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
We've had this exact hardware for years now (all the CPU trays forLenovo have been dual trays for the past few generations thoughpreviously they used a Y cable for connecting both). Basically theway we handle it is to drain its partner node whenever one goesdown for a hardware issue.
The SD665 V3 system was announced in Nov. 2022. This V3 generationseems to come with a single IB cable per tray with 2 nodes. Inretrospect, I would wish for independent IB adapters in each nodeand an IB splitter cable (Y-cable) with 200 Gb to 2x100 Gbtransceivers.
I agree that we can drain partner nodes in Slurm when servicing a node.
That said you are free to reboot either node with out loss ofconnectivity. We do that all the time with no issues. As notedthough if you want to actually physically service the nodes, thenyou have to take out both.
What we have experienced several times is that multi-node MPI jobs,running on the left-hand SD665 V3 node plus other nodes in thecluster, crash when the right-hand node is rebooted for a kernelupdate or whatever reason. The right-hand node of course houses thephysical SharedIO Infiniband adapter.
My interpretation is that the IB adapter gets reset when the right-hand node reboots, disrupting also IB traffic to the left-hand nodefor a while and causing job crashes.
Have you seen any behavior like this?

Thanks,
Ole
On 8/26/2024 8:51 AM, Ole Holm Nielsen via slurm-users wrote:
We're experimenting with ways to manage our new racks of LenovoSD665 V3 dual-server trays with Direct Water Cooling (furtherinformation is in our Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray withcommon power and water cooling. This wouldn't be so bad if itweren't for Lenovo's NVIDIA/Mellanox SharedIO Infiniband adapters,where the left- hand node's IB adapter is a client of theright-hand node's adapter. So we can't reboot or power down theright-hand node without killing any MPI jobs that happen to beusing the left-hand node.
My question is if other Slurm sites owning Lenovo dual-servertrays with SharedIO Infiniband adapters have developed some cleverways of handling such node pairs a single entity somehow? Isthere anything we should configure on the Slurm side to make suchnodes easier to manage?
Thanks for sharing any insights,
Ole


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

Reply via email to