date:20231030

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang

The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8epel -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM T

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones, AFAICT

Re: [slurm-users] Sinfo options not working in SLURM 23.11

2023-10-30 Thread Davide DelVento

> > I am working on SLURM 23.11 version. > ??? Latest version is slurm-23.02.6 which one are you referring to? https://github.com/SchedMD/slurm/tags >

Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

if i try to request just nodes and memory, for instance: #SBATCH -N 2 #SBATCH --mem=0 to resquest all memory on a node, and 2nodes seem sufficient for a program that consumes 100GB, i ot this error: sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jens Elkner

On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote: Hi Max & freinds, ... > Thanks so much for your fast response with a solution! I didn't know that > NetworkManager (falsely) claims that the network is online as soon as the > first interface comes up :-( IIRC it is documented in t

[slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

Hello all, I can't configure the slurm script correctly. My program needs 100GB of memory, it's the only criteria. But the job always fails with an out of memory. Here's the cluster configuration I'm using: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory partition: DefMemPerC

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Max, Thanks so much for your fast response with a solution! I didn't know that NetworkManager (falsely) claims that the network is online as soon as the first interface comes up :-( Your solution of a wait-for-interfaces Systemd service makes a lot of sense, and I'm going to try it out.

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Max Rutkowski

Hi, we're not using Omni-Path but also had issues with Infiniband taking too long and slurmd failing to start due to that. Our solution was to implement a little wait-for-interface systemd service which delays the network.target until the ib interface has come up. Our discovery was that the

[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

I'm fighting this strange scenario where slurmd is started before the Infiniband/OPA network is fully up. The Node Health Check (NHC) executed by slurmd then fails the node (as it should). This happens only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with Infiniband/OPA n

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Re: [slurm-users] Sinfo options not working in SLURM 23.11

Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

[slurm-users] how to configure correctly node and memory when a script fails with out of memory

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

9 matches

Site Navigation

Mail list logo

Footer information