Viktor Dukhovni wrote:
> Could a watchdog timer have killed master(8) if it were suspended
> long enough?

Seems plausible.  I could see something in the code timing out since
things would be blocked waiting for I/O for so long.a

> Demi Marie Obenour:
> > My intuition is that either some timeout somewhere got hit, or that
> > some I/O failed (rather than being queued forever) and caused an error
> > paging in some code.  That would cause Postfix to die with SIGBUS.
> 
> If the file system was unavailable, then yes, failure to page in
> some code would be fatal.

This is a good brainstorm.  I wasn't thinking about the swap side of
memory.  It seems very plausible to me that a paged out block might
have been needed.  And that might have timed out and been reported as
a an I/O failure.  Which would have killed the process.  Or possibly
the reverse.  The system may have tried to page out a block and the
writing of that block may have timed out as well.

> > Do you have Postfix set to automatically be restarted if it crashes?

No.  Postfix is very reliable and robust.  It has never been needed.

And I think I will resist the urge to add automated restarting of
postfix now too.  Because this was a very unusual situation.  I know
we always fight the last war.  I doubt this will be a repeating
problem.  But it would add a layer of snag that another admin might
not be expecting.

Plus I have now learned that if the network is offline for any
significant time then all affected systems should be rebooted as a
precautionary.  And a reboot is always okay.  Systems reboot just
fine.

Instead I think I will add a watchdog of some sort that would
automatically detect this type of network attached storage outage and
then automatically reboot the system if it detects that it is
recovering from such a state.  That's harder to do.  But it solves the
problem for the entire system globally.

> I expect that the restart would fail for the same reason as you
> describe above.

I would expect that it would block waiting for I/O and simply wait to
start.  It would stack up as another process that increases the load
average.  And then eventually when the disk request was serviced then
it would continue and start then.

Thank you everyone for brainstorming along with me.  It's a good
learning experience.  And I think I know I need a way to detect that
the network attached block storage has been offline too long and that
the system when recovered from that needs to be rebooted.

Bob

Reply via email to