On Sat, Nov 7, 2020 at 12:40 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > I think one of the problems is returning fewer rows and that too > without any warning or error, so maybe that is a bigger problem but we > seem to be okay with it as that is already a known thing though I > think that is not documented anywhere.
I'm not OK with it, and I'm not sure it's widely known or understood, though I think we've made some progress in this thread. Perhaps, as a separate project, we need to solve several related problems with a shmem table of relation sizes from not-yet-synced files so that smgrnblocks() is fast and always sees all preceding smgrextend() calls. If we're going to need something like that anyway, and if we can come up with a simple way to detect and report this type of failure in the meantime, maybe this fast DROP project should just go ahead and use the existing smgrnblocks() function without the weird caching bandaid that only works in recovery? > > The main argument I can think of against the idea of using plain old > > smgrnblocks() is that the current error messages on smgrwrite() > > failure for stray blocks would be indistinguishible from cases where > > an external actor unlinked the file. I don't mind getting an error > > that prevents checkpointing -- your system is in big trouble! -- but > > it'd be nice to be able to detect that *we* unlinked the file, > > implying the filesystem and bufferpool are out of sync, and spit out a > > special diagnostic message. I suppose if it's the checkpointer doing > > the writing, it could check if the relfilenode is on the > > queued-up-for-delete-after-the-checkpoint list, and if so, it could > > produce a different error message just for this edge case. > > Unfortunately that's not a general solution, because any backend might > > try to write a buffer out and they aren't synchronised with > > checkpoints. > > Yeah, but I am not sure if we can consider manual (external actor) > tinkering with the files the same as something that happened due to > the database server relying on the wrong information. Here's a rough idea I thought of to detect this case; I'm not sure if it has holes. When unlinking a relation, currently we truncate segment 0 and unlink all the rest of the segments, and tell the checkpointer to unlink segment 0 after the next checkpoint. What if we also renamed segment 0 to "$X.dropped" (to be unlinked by the checkpointer), and taught GetNewRelFileNode() to also skip anything for which "$X.dropped" exists? Then mdwrite() could use _mdfd_getseg(EXTENSION_RETURN_NULL), and if it gets NULL (= no file), then it checks if "$X.dropped" exists, and if so it knows that it is trying to write a stray block from a dropped relation in the buffer pool. Then we panic, or warn but drop the write. The point of the renaming is that (1) mdwrite() for segment 0 will detect the missing file (not just higher segments), (2) every backends can see that a relation has been recently dropped, while also interlocking with the checkpointer though buffer locks. > One vague idea could be to develop pg_test_seek which can detect such > problems but not sure if we can rely on such a tool to always give us > the right answer. Were you able to consistently reproduce the lseek > problem on the system where you have tried? Yeah, I can reproduce that reliably, but it requires quite a bit of set-up as root so it might be tricky to package up in easy to run form. It might be quite nice to prepare an easy-to-use "gallery of weird buffered I/O effects" project, including some of the local-filesystem-with-fault-injection stuff that Craig Ringer and others were testing with a couple of years ago, but maybe not in the pg repo.