PANIC serves too many masters

Andres Freund Sat, 18 Nov 2023 14:29:31 -0800

Hi,

Right now we use PANIC for very different kinds of errors.


Sometimes for errors that are persistent, where crash-restarting and trying
again won't help:
  ereport(PANIC,
          (errmsg("could not locate a valid checkpoint record")));
or
  ereport(PANIC,
          (errmsg("online backup was canceled, recovery cannot continue")));



Sometimes for errors that could be transient, e.g. when running out of space
while trying to write out WAL:
  ereport(ERROR,
          (errcode_for_file_access(),
           errmsg("could not write to file \"%s\": %m", tmppath)));
(the ERROR is often promoted to a PANIC due to critical sections).
or
ereport(PANIC,
        (errcode_for_file_access(),
         errmsg("could not write to log file \"%s\" at offset %u, length %zu: 
%m",
                xlogfname, startoffset, nleft)));


Sometimes for "should never happen" checks that are important enough to check
in production builds:
  elog(PANIC, "stuck spinlock detected at %s, %s:%d",
or
  elog(PANIC, "failed to re-find shared proclock object");


I have two main issues with this:

1) If core dumps are allowed, we trigger core dumps for all of these. That's
   good for "should never happen" type of errors like a stuck spinlock. But it
   makes no sense for things like the on-disk state being wrong at startup, or
   running out of space while writing WAL - if anything, it might make that
   worse!

   It's very useful to be able to collect dumps for crashes in production, but
   it's not useful to generate thousands of identical cores because crash
   recovery fails with out-of-space and we retry over and over.


2) For errors where crash-restarting won't fix anything, using PANIC doesn't
   allow postmaster to distinguish between an error that should lead
   postmaster to exit itself (after killing other processes, obviously) and
   the normal crash restart cycle.


I've been trying to do some fleet wide analyses of the causes of crashes, but
having core dumps for lots of stuff that aren't crashes, often repeated many
times, makes that much harder. Filtering out abort()s and just looking at
segfaults filters out far too much.


I don't quite know what we should do. But the current situation decidedly
doesn't seem great.

Maybe we could have:
- PANIC_BUG - triggers abort() followed by a crash restart cycle
  to be used for things like a stuck spinlock
- PANIC_RETRY - causes a crash restart cycle, no core dump
  to be used for things like ENOSPC during WAL writes
- PANIC_EXIT - causes postmaster to exit(1)
  to be used for things where retrying won't help, like
  "requested recovery stop point is before consistent recovery point"


One could argue that some of the PANICs that want to just shut down the server
should instead be FATALs, with knowledge in postmaster about which/when such
errors should trigger exiting. We do have something like this for the startup
process, but only when errors happen "early enough", and without being able to
distinguish between "retryable" and "should exit" type errors.  But ISTM that
that requires adding more and more knowledge to postmaster.c, instead of
leaving it with the code that raises the error.


Greetings,

Andres Freund

PANIC serves too many masters

Reply via email to