On Tue, Nov 26, 2019 at 5:21 PM Justin Pryzby <pry...@telsasoft.com> wrote: > I looked and found a new "hint". > > On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote: > > < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file > > "base/16491/1731839470.2": No such file or directory > > < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was > > terminated by signal 6: Aborted > > An earlier segment of that relation had been opened successfully and was > *still* opened: > > $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds > 63:/var/lib/pgsql/12/data/base/16491/1731839470 > > For context: > > $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3 > 61:/var/lib/pgsql/12/data/base/16491/1757077748 > 62:/var/lib/pgsql/12/data/base/16491/1756223121.2 > 63:/var/lib/pgsql/12/data/base/16491/1731839470 > > So this may be an issue only with relations>segment (but, that interpretation > could also be very naive).
FTR I have been trying to reproduce this but failing so far. I'm planning to dig some more in the next couple of days. Yeah, it's a .2 file, which means that it's one that would normally be unlinked after you commit your transaction (unlike a no-suffix file, which would normally be dropped at the next checkpoint after the commit, as our strategy to prevent the relfilenode from being reused before the next checkpoint cycle), but should normally have had a SYNC_FORGET_REQUEST enqueued for it first. So the question is, how did it come to pass that a .2 file was ENOENT but there was no forget request? Diificult, given the definition of mdunlinkfork(). I wondered if something was going wrong in queue compaction or something like that, but I don't see it. I need to dig into the exactly flow with the ALTER case to see if there is something I'm missing there, and perhaps try reproducing it with a tiny segment size to exercise some more multisegment-related code paths.