On Fri, Nov 29, 2019 at 11:14 AM Justin Pryzby <pry...@telsasoft.com> wrote: > On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote: > > On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > > > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby <pry...@telsasoft.com> > > > wrote: > > > > 2019-11-26 23:41:50.009-05 | could not fsync file > > > > "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or > > > > directory > > > > > > I managed to reproduce this (see below). I think I know what the > > > > Here is a patch that fixes the problem by sending all the > > SYNC_FORGET_REQUEST messages up front. > > I managed to reproduce it too, but my recipe is crummy enough that I'm not > even > going to send it.. > > I confirmed that patch also seems to work for my worse recipe.
Thanks. One thing I'm wondering about is what happens if you encounter EPERM, EIO etc while probing the existing files, so that we give up early and don't deal with some higher numbered files. (1) We'll still have unlinked the lower numbered files, and we'll leave the higher numbered files where they are, which might confuse matters if the relfilenode number if later recycled so the zombie segments appear to belong to the new relfilenode. That is longstanding PostgreSQL behaviour not changed by commit 3eb77eba or this patch IIUC (despite moving a few things around), and I guess it's unlikely to bite you considering all the coincidences required, but if there's a transient EPERM (think Windows virus checker opening files without the shared read flag), it's not inconceivable. One solution to that would be not to queue the unlink request for segment 0 if anything goes wrong while unlinking the other segments (in other words: if you can't unlink all the segments, deliberately leak segment 0 and thus the *whole relfilenode*, not just the problem file(s)). (2) Even with the fix I just proposed, if you give up early due to EPERM, EIO etc, there might still be sync requests queued for high numbered segments, so you could reach the PANIC case. It's likely that, once you've hit such an error, the checkpointer is going to be panicking anyway when it starts seeing similar errors, but still, it'd be possible to do better here (especially if the error was transient and short lived). If we don't accept that behaviour, we could switch (back) to a single cancel message that can whack every request relating to the relation (essentially as it was in 11, though it requires a bit more work for the new design), or stop using _mdfd_getseg() for this so that you can remove segments independently without worrying about sync requests for other segments (it was actually like that in an earlier version of the patch for commit 3eb77eba, but someone complained that it didn't benifit from fd caching).