Hi Rémi,

Thanks for the feedback!  The patch revert[1] explains SchedMD's reason:

The reasoning is that sysadmins who see nodes with Reason "Not Responding"
but they can manually ping/access the node end up confused. That reason
should only be set if the node is trully not responding, but not if the
HealthCheckProgram execution failed or returned non-zero exit code. For
that case, the program itself would take the appropiate actions, such
as draining the node and setting an appropiate Reason.

We speculate that there may possibly be an issue with slurmd starting up at boot time and starting new jobs, while NHC is running in a separate thread and possibly fails the node AFTER the job has started! NHC might fail, for example, if an Infiniband/OPA network or NVIDIA GPUs have not yet started up completely.

I still need to verify whether this observation is correct and reproducible. Does anyone have evidence that jobs start before NHC is complete when slurmd starts up?

IMHO, slurmd ought to start up without delay at boot time, then execute the NHC and wait for it to complete. Only after NHC has succeeded without errors should slurmd begin accepting new jobs.

We should configure NHC to make site-specific hardware and network checks, for example for Infiniband/OPA network or NVIDIA GPUs.

Best regards,
Ole

On 11/1/23 09:44, Rémi Palancher wrote:
Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
I'm fighting this strange scenario where slurmd is started before the
Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
by slurmd then fails the node (as it should).  This happens only on EL8
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
Infiniband/OPA network work without problems.

Question: Does anyone know how to reliably delay the start of the slurmd
Systemd service until the Infiniband/OPA network is fully up?

…

FWIW, after a while struggling with systemd dependencies to wait for
availability of networks and shared filesystems, we ended up with a
customer writing a patch in Slurm to delay slurmd registration (and jobs
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1]
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal
and should be taken with caution but it works for years for this customer.

[1]
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Reply via email to