The service is available in RHEL 8 via the EPEL package repository as
system-networkd, i.e. systemd-networkd.x86_64
253.4-1.el8epel
-Original Message-
From: slurm-users On Behalf Of Ole Holm
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
T
Hi Jens,
Thanks for your feedback:
On 30-10-2023 15:52, Jens Elkner wrote:
Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.
It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT
>
> I am working on SLURM 23.11 version.
>
???
Latest version is slurm-23.02.6 which one are you referring to?
https://github.com/SchedMD/slurm/tags
>
if i try to request just nodes and memory, for instance:
#SBATCH -N 2
#SBATCH --mem=0
to resquest all memory on a node, and 2nodes seem sufficient for a
program that consumes 100GB, i ot this error:
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote:
Hi Max & freinds,
...
> Thanks so much for your fast response with a solution! I didn't know that
> NetworkManager (falsely) claims that the network is online as soon as the
> first interface comes up :-(
IIRC it is documented in t
Hello all,
I can't configure the slurm script correctly. My program needs 100GB of
memory, it's the only criteria. But the job always fails with an out of
memory.
Here's the cluster configuration I'm using:
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
partition:
DefMemPerC
Hi Max,
Thanks so much for your fast response with a solution! I didn't know that
NetworkManager (falsely) claims that the network is online as soon as the
first interface comes up :-(
Your solution of a wait-for-interfaces Systemd service makes a lot of
sense, and I'm going to try it out.
Hi,
we're not using Omni-Path but also had issues with Infiniband taking too
long and slurmd failing to start due to that.
Our solution was to implement a little wait-for-interface systemd
service which delays the network.target until the ib interface has come up.
Our discovery was that the
I'm fighting this strange scenario where slurmd is started before the
Infiniband/OPA network is fully up. The Node Health Check (NHC) executed
by slurmd then fails the node (as it should). This happens only on EL8
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
Infiniband/OPA n