[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
We built our stack using helmod, which is an extension of LMOD using rpm spec files.  Our spec for openmpi can be found here: https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/rocky8/openmpi-5.0.2-fasrc01.spec I've tested with both Intel and GCC and have seen no issues (we use Reframe

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Ole Holm Nielsen via slurm-users
On 26-08-2024 20:30, Paul Edmon via slurm-users wrote: I haven't seen any behavior like that.  For reference we are running Rocky 8.9 with MOFED 23.10.2 That's interesting! Our nodes run Rocky 8.10 and have installed the Mellanox driver tar-ball MLNX_OFED_LINUX-24.04-0.7.0.0-rhel8.9-x86_64.t

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
I haven't seen any behavior like that.  For reference we are running Rocky 8.9 with MOFED 23.10.2 -Paul Edmon- On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote: Hi Paul, On 26-08-2024 15:29, Paul Edmon via slurm-users wrote: We've had this exact hardware for years now (all the CPU

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Ole Holm Nielsen via slurm-users
Hi Paul, On 26-08-2024 15:29, Paul Edmon via slurm-users wrote: We've had this exact hardware for years now (all the CPU trays for Lenovo have been dual trays for the past few generations though previously they used a Y cable for connecting both). Basically the way we handle it is to drain its

[slurm-users] Re: Spread a multistep job across clusters

2024-08-26 Thread Davide DelVento via slurm-users
Ciao Fabio, That for sure is syntactically incorrect, because the way sbatch parsing works: as soon as it finds a non-empy non-comment line (your first srun) it will stop parsing for #SBATCH directives. So assuming this is a single file as it looks from the formatting, the second hetjob and the cl

[slurm-users] Spread a multistep job across clusters

2024-08-26 Thread Di Bernardini, Fabio via slurm-users
Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps. I'm trying with hetjobs but it's not clear to me from the documentation (https://slurm.schedmd.com/heterogeneous_jobs.html) if this is possible and how to do it.

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
We've had this exact hardware for years now (all the CPU trays for Lenovo have been dual trays for the past few generations though previously they used a Y cable for connecting both). Basically the way we handle it is to drain its partner node whenever one goes down for a hardware issue. That

[slurm-users] Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Ole Holm Nielsen via slurm-users
We're experimenting with ways to manage our new racks of Lenovo SD665 V3 dual-server trays with Direct Water Cooling (further information is in our Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ ) Management problems arise because 2 servers share a tray with common power and water

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-08-26 Thread William V via slurm-users
Hello, Thanks again for your documentation, I deployed 24.05.2 last week. But this weekend slurmctld crashed with only the following in the logs: "Aug 25 15:33:02 slurmadmin slurmctld[79950]: free(): invalid next size (fast)" Also, I regularly get these messages in my logs even though these two