We built our stack using helmod, which is an extension of LMOD using rpm
spec files. Our spec for openmpi can be found here:
https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/rocky8/openmpi-5.0.2-fasrc01.spec
I've tested with both Intel and GCC and have seen no issues (we use
Reframe
On 26-08-2024 20:30, Paul Edmon via slurm-users wrote:
I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.2
That's interesting! Our nodes run Rocky 8.10 and have installed the
Mellanox driver tar-ball
MLNX_OFED_LINUX-24.04-0.7.0.0-rhel8.9-x86_64.t
I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.2
-Paul Edmon-
On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote:
Hi Paul,
On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
We've had this exact hardware for years now (all the CPU
Hi Paul,
On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
We've had this exact hardware for years now (all the CPU trays for
Lenovo have been dual trays for the past few generations though
previously they used a Y cable for connecting both). Basically the way
we handle it is to drain its
Ciao Fabio,
That for sure is syntactically incorrect, because the way sbatch parsing
works: as soon as it finds a non-empy non-comment line (your first srun) it
will stop parsing for #SBATCH directives. So assuming this is a single file
as it looks from the formatting, the second hetjob and the cl
Hi everyone, for accounting reasons, I need to create only one job across two
or more federated clusters with two or more srun steps.
I'm trying with hetjobs but it's not clear to me from the documentation
(https://slurm.schedmd.com/heterogeneous_jobs.html) if this is possible and how
to do it.
We've had this exact hardware for years now (all the CPU trays for
Lenovo have been dual trays for the past few generations though
previously they used a Y cable for connecting both). Basically the way
we handle it is to drain its partner node whenever one goes down for a
hardware issue.
That
We're experimenting with ways to manage our new racks of Lenovo SD665 V3
dual-server trays with Direct Water Cooling (further information is in our
Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray with common power
and water
Hello,
Thanks again for your documentation, I deployed 24.05.2 last week.
But this weekend slurmctld crashed with only the following in the logs:
"Aug 25 15:33:02 slurmadmin slurmctld[79950]: free(): invalid next size (fast)"
Also, I regularly get these messages in my logs even though these two