Hi, Right now we use PANIC for very different kinds of errors.
Sometimes for errors that are persistent, where crash-restarting and trying again won't help: ereport(PANIC, (errmsg("could not locate a valid checkpoint record"))); or ereport(PANIC, (errmsg("online backup was canceled, recovery cannot continue"))); Sometimes for errors that could be transient, e.g. when running out of space while trying to write out WAL: ereport(ERROR, (errcode_for_file_access(), errmsg("could not write to file \"%s\": %m", tmppath))); (the ERROR is often promoted to a PANIC due to critical sections). or ereport(PANIC, (errcode_for_file_access(), errmsg("could not write to log file \"%s\" at offset %u, length %zu: %m", xlogfname, startoffset, nleft))); Sometimes for "should never happen" checks that are important enough to check in production builds: elog(PANIC, "stuck spinlock detected at %s, %s:%d", or elog(PANIC, "failed to re-find shared proclock object"); I have two main issues with this: 1) If core dumps are allowed, we trigger core dumps for all of these. That's good for "should never happen" type of errors like a stuck spinlock. But it makes no sense for things like the on-disk state being wrong at startup, or running out of space while writing WAL - if anything, it might make that worse! It's very useful to be able to collect dumps for crashes in production, but it's not useful to generate thousands of identical cores because crash recovery fails with out-of-space and we retry over and over. 2) For errors where crash-restarting won't fix anything, using PANIC doesn't allow postmaster to distinguish between an error that should lead postmaster to exit itself (after killing other processes, obviously) and the normal crash restart cycle. I've been trying to do some fleet wide analyses of the causes of crashes, but having core dumps for lots of stuff that aren't crashes, often repeated many times, makes that much harder. Filtering out abort()s and just looking at segfaults filters out far too much. I don't quite know what we should do. But the current situation decidedly doesn't seem great. Maybe we could have: - PANIC_BUG - triggers abort() followed by a crash restart cycle to be used for things like a stuck spinlock - PANIC_RETRY - causes a crash restart cycle, no core dump to be used for things like ENOSPC during WAL writes - PANIC_EXIT - causes postmaster to exit(1) to be used for things where retrying won't help, like "requested recovery stop point is before consistent recovery point" One could argue that some of the PANICs that want to just shut down the server should instead be FATALs, with knowledge in postmaster about which/when such errors should trigger exiting. We do have something like this for the startup process, but only when errors happen "early enough", and without being able to distinguish between "retryable" and "should exit" type errors. But ISTM that that requires adding more and more knowledge to postmaster.c, instead of leaving it with the code that raises the error. Greetings, Andres Freund