On Tue, Oct 18, 2022 at 3:59 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Isn't it already the case (or could be made so) that relation file > removal happens only in the checkpointer? I wonder if we could > get to a situation where we can interlock file removal just by > commanding the checkpointer to not do it for awhile. Then combining > that with caching readdir results (to narrow the window in which we > have to stop the checkpointer) might yield a solution that has some > credibility. This scheme doesn't attempt to prevent file creation > concurrently with a readdir, but you'd have to make some really > adverse assumptions to believe that file creation would cause a > pre-existing entry to get missed (as opposed to getting scanned > twice). So it might be an acceptable answer.
I believe that individual backends directly remove all relation forks other than the main fork and all segments other than the first one. The discussion on various other threads has been in the direction of trying to standardize on moving that last case out of the checkpointer - i.e. getting rid of what Thomas dubbed "tombstone" files - which is pretty much the exact opposite of this proposal. But even apart from that, I don't think this would be that easy to implement. If you removed a large relation, you'd have to tell the checkpointer to remove many files instead of just 1. That sounds kinda painful: it would be more IPC, and it would delay file removal just so that we can tell the checkpointer to delay it some more. And I don't think we really need to do any of that. We could invent a new kind of lock tag for <dboid/tsoid> combination. Take a share lock to create or remove files. Take an exclusive lock to scan the directory. I think that accomplishes the same thing as your proposal, but more directly, and with less overhead. It's still substantially more than NO overhead, though. -- Robert Haas EDB: http://www.enterprisedb.com