I haven't thought about it too hard, but the default NHC scripts do
not notice that.
That's the problem with NHC and any other problem-checking script: You
have to tell them what errors to check for. As you errors occur, those
scripts inevitably grow longer.
--
Prentice
On 5/4/21 12:47 PM, Alex Chekholko wrote:
In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS
disk. I haven't thought about it too hard, but the default NHC
scripts do not notice that. Similarly, Paul's proposed script might
need to also check that the slurm log file is readable.
The way I detect it myself is when a random swath of jobs fails and
then when I SSH to the node and get an I/O error instead of a regular
connection.
On Tue, May 4, 2021 at 9:41 AM Paul Edmon <ped...@cfa.harvard.edu
<mailto:ped...@cfa.harvard.edu>> wrote:
Since you can run an arbitrary script as a node health checker I
might
add a script that counts failures and then closes if it hits a
threshold. The script shouldn't need to talk to the slurmctld or
slurmdbd as it should be able to watch the log on the node and see
the fail.
-Paul Edmon-
On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
> Hello,
>
> how do you implement something like "drain host after 10 consecutive
> failed jobs"? Unlike a host check script, that checks for known
errors,
> I'd like to stop killing jobs just because one node is faulty.
>
> Gerhard
>