On Wed, Feb 9, 2022 at 5:00 PM Andres Freund <and...@anarazel.de> wrote: > The problem starts with > > commit aa01051418f10afbdfa781b8dc109615ca785ff9 > Author: Robert Haas <rh...@postgresql.org> > Date: 2022-01-24 14:23:15 -0500 > > pg_upgrade: Preserve database OIDs.
Well, that's sad. > I think the most realistic way to address this is the mechanism is to use the > mechanism from > https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com I agree. While I feel sort of bad about missing this issue in review, I also feel like it's pretty surprising that there isn't something plugging this hole already. It feels unexpected that our FD management layer might hand you an FD that references the previous file that had the same name instead of the one that exists right now, and I don't think it's too much of a stretch of the imagination to suppose that this won't be the last problem we have if we don't fix it. I also agree that the mechanism proposed in that thread is the best way to fix the problem. The sinval mechanism won't work, since there's nothing to ensure that the invalidation messages are processed in a sufficient timely fashion, and there's no obvious way to get around that problem. But the ProcSignalBarrier mechanism is not intertwined with heavyweight locking in the way that sinval is, so it should be a good fit for this problem. The main question in my mind is who is going to actually make that happen. It was your idea (I think), Thomas coded it, and my commit made it a live problem. So who's going to get something committed here? > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed. What about movedb()? > We perhaps could avoid all relfilenode recycling, issues on the fd level, by > adding a barrier whenever relfilenode allocation wraps around. But I'm not > sure how easy that is to detect. I definitely think it would be nice to cover this issue not only for database OIDs and tablespace OIDs but also for relfilenode OIDs. Otherwise we're right on the cusp of having the same bug in that case. pg_upgrade doesn't drop and recreate tables the way it does databases, but if somebody made it do that in the future, would they think about this? I bet not. Dilip's been working on your idea of making relfilenode into a 56-bit quantity. That would make the problem of detecting wraparound go away. In the meantime, I suppose we could do think of trying to do something here: /* wraparound, or first post-initdb assignment, in normal mode */ ShmemVariableCache->nextOid = FirstNormalObjectId; ShmemVariableCache->oidCount = 0; That's a bit sketch, because this is the OID counter, which isn't only used for relfilenodes. I'm not sure whether there would be problems waiting for a barrier here - I think we might be holding some lwlocks in some code paths. > I think we might actually be able to remove the checkpoint from dropdb() by > using barriers? That looks like it might work. We'd need to be sure that the checkpointer would both close any open FDs and also absorb fsync requests before acknowledging the barrier. > I think we need to add some debugging code to detect "fd to deleted file of > same name" style problems. Especially during regression tests they're quite > hard to notice, because most of the contents read/written are identical across > recycling. Maybe. If we just plug the holes here, I am not sure we need permanent instrumentation. But I'm also not violently opposed to it. -- Robert Haas EDB: http://www.enterprisedb.com