Viktor Dukhovni wrote: > Could a watchdog timer have killed master(8) if it were suspended > long enough?
Seems plausible. I could see something in the code timing out since things would be blocked waiting for I/O for so long.a > Demi Marie Obenour: > > My intuition is that either some timeout somewhere got hit, or that > > some I/O failed (rather than being queued forever) and caused an error > > paging in some code. That would cause Postfix to die with SIGBUS. > > If the file system was unavailable, then yes, failure to page in > some code would be fatal. This is a good brainstorm. I wasn't thinking about the swap side of memory. It seems very plausible to me that a paged out block might have been needed. And that might have timed out and been reported as a an I/O failure. Which would have killed the process. Or possibly the reverse. The system may have tried to page out a block and the writing of that block may have timed out as well. > > Do you have Postfix set to automatically be restarted if it crashes? No. Postfix is very reliable and robust. It has never been needed. And I think I will resist the urge to add automated restarting of postfix now too. Because this was a very unusual situation. I know we always fight the last war. I doubt this will be a repeating problem. But it would add a layer of snag that another admin might not be expecting. Plus I have now learned that if the network is offline for any significant time then all affected systems should be rebooted as a precautionary. And a reboot is always okay. Systems reboot just fine. Instead I think I will add a watchdog of some sort that would automatically detect this type of network attached storage outage and then automatically reboot the system if it detects that it is recovering from such a state. That's harder to do. But it solves the problem for the entire system globally. > I expect that the restart would fail for the same reason as you > describe above. I would expect that it would block waiting for I/O and simply wait to start. It would stack up as another process that increases the load average. And then eventually when the disk request was serviced then it would continue and start then. Thank you everyone for brainstorming along with me. It's a good learning experience. And I think I know I need a way to detect that the network attached block storage has been offline too long and that the system when recovered from that needs to be rebooted. Bob