So there are different options you can set for Return to Service in the
slurm.conf which can effect how the node is handled on reconnect. You
can also up the timeouts for the daemons.
-Paul Edmon-
On 8/31/2018 5:06 PM, Renfro, Michael wrote:
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing,
if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in
the last year, I’ve had a failure inside the stacked Ethernet switches that’s
caused Slurm to lose track of node and job state. Jobs kept running as normal,
since all file traffic is on the Infiniband network.
In both cases, I wasn’t able to cleanly recover. On the first outage, my
attempt at recovery (pretty sure I forcibly drained and resumed the nodes)
caused all active jobs to be killed, and then the next group of queued jobs to
start. On the second outage, all active jobs were restarted from scratch,
including truncating and overwriting any existing output. I think that involved
my restarting slurmd or slurmctld services, but I’m not certain.
I’ve built a VM test environment with OpenHPC and Slurm 17.11 to simulate these
kinds of failures, but haven’t been able to reproduce my earlier results. After a
sufficiently long network outage, I get downed nodes with "Reason=Duplicate
jobid”.
Basically, I’d like to know what the proper procedure is for recovering from
this kind of outage in the Slurm control network without losing the output from
running jobs. Not sure if I can easily add any redundancy in the Ethernet
network, but I may be able to add in the Infiniband network for control if
that’s supported. Thanks.