On Wed, Jul 12, 2023 at 4:26 AM Michael Paquier <mich...@paquier.xyz> wrote: > Taking the test case of upthread, from what I can see, the ALTER TABLE > .. REPLICA IDENTITY registers two relcache invalidations for pk_foo > (via RegisterRelcacheInvalidation), which is the relcache entry whose > stuff is messed up. I would have expected a refresh of the cache of > pk_foo to happen when doing the ALTER INDEX .. ATTACH PARTITION, but > for some reason it does not happen when running the whole in a > transaction block. I cannot put my finger on what's wrong for the > moment, but based on my current impressions the inval requests are > correctly registered when switching the replica identity, but nothing > about pk_foo is updated when attaching a partition to it in the last > step where the invisible update happens :/
I'm not sure exactly what is happening here, but it looks to me like ATExecReplicaIdentity() only takes ShareLock on the index and nevertheless feels entitled to update the pg_index tuple, which is pretty strange. We normally require AccessExclusiveLock to perform DDL on an object, and in the limited exceptions that we have to that rule - see AlterTableGetLockLevel - it's pretty much always a self-exclusive lock. Otherwise, two backends might try to do the same DDL operation at the same time, which would lead to low-level failures trying to update the same tuple such as the one seen here. But even if that doesn't happen or is prevented by some other mechanism, there's still a synchronization problem. Suppose backend B1 modifies some state via a DDL operation on table T and then afterward backend B2 wants to perform a non-DDL operation that depends on that state. Well, B1 takes some lock on the relation, and B2 takes a lock that would conflict with it, and that guarantees that B2 starts after B1 commits. That means that B2 is guaranteed to see the invalidations that were queued by B1, which means it will flush any state out of its cache that was made stale by the operation performed by B1. If the locks didn't conflict, B2 might start before B1 committed and either fail to update its caches or update them but with a version of the tuples that's about to be made obsolete when B1 commits. So ShareLock doesn't feel like a very safe choice here. But I'm not quite sure exactly what's going wrong, either. Every update is going to call CacheInvalidateHeapTuple(), and updating either an index's pg_class tuple or its pg_index tuple should queue up a relcache invalidation, and CommandEndInvalidationMessages() should cause that to be processed. If this were multiple transactions, the only thing that would be different is that the invalidation messages would be in the shared queue, so maybe there's something going on with the timing of CommandEndInvalidationMessages() vs. AcceptInvalidationMessages() that accounts for the problem occurring in one case but not the other. But I do wonder whether the underlying problem is that what ATExecReplicaIdentity() is doing is not really safe. -- Robert Haas EDB: http://www.enterprisedb.com