On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > I wonder if this all stems from a misunderstanding of what I suggested > to David offlist. My suggestion was that the catalog scans would > continue to use the catalog MVCC snapshot, and that the relcache entries > would contain all the partitions that appear to the catalog; but each > partition's entry would carry the Xid of the creating transaction in a > field (say xpart), and that field is compared to the regular transaction > snapshot: if xpart is visible to the transaction snapshot, then the > partition is visible, otherwise not. So you never try to access a > partition that doesn't exist, because those just don't appear at all in > the relcache entry. But if you have an old transaction running with an > old snapshot, and the partitioned table just acquired a new partition, > then whether the partition will be returned as part of the partition > descriptor or not depends on the visibility of its entry.
Hmm. One question is where you're going to get the XID of the creating transaction. If it's taken from the pg_class row or the pg_inherits row or something of that sort, then you risk getting a bogus value if something updates that row other than what you expect -- and the consequences of that are pretty bad here; for this to work as you intend, you need an exactly-correct value, not newer or older. An alternative is to add an xid field that stores the value explicitly, and that might work, but you'll have to arrange for that value to be frozen at the appropriate time. A further problem is that there could be multiple changes in quick succession. Suppose that a partition is attached, then detached before the attach operation is all-visible, then reattached, perhaps with different partition bounds. > I think that works fine for ATTACH without any further changes. I'm not > so sure about DETACH, particularly when snapshots persist for a "long > time" (a repeatable-read transaction). ISTM that in the above design, > the partition descriptor would lose the entry for the detached partition > ahead of time, which means queries would silently fail to see their data > (though they wouldn't crash). I don't see why they wouldn't crash. If the partition descriptor gets rebuilt and some partitions disappear out from under you, the old partition descriptor is going to get freed, and the executor has a cached pointer to it, so it seems like you are in trouble. > I first thought this could be fixed by > waiting for those snapshots to finish, but then I realized that there's > no actual place where waiting achieves anything. Certainly it's not > useful to wait before commit (because other snapshots are going to be > starting all the time), and it's not useful to start after the commit > (because by then the catalog tuple is already gone). Maybe we need two > transactions: mark partition as removed with an xmax of sorts, commit, > wait for all snapshots, start transaction, remove partition catalog > tuple, commit. And what would that accomplish, exactly? Waiting for all snapshots would ensure that all still-running transactions see the fact the xmax with which the partition has been marked as removed, but what good does that do? In order to have a plausible algorithm, you have to describe both what the ATTACH/DETACH operation does and what the other concurrent transactions do and how those things interact. Otherwise, it's like saying that we're going to solve a problem with X and Y overlapping by having X take a lock. If Y doesn't take a conflicting lock, this does nothing. Generally, I think I see what you're aiming at: make ATTACH and DETACH have MVCC-like semantics with respect to concurrent transactions. I don't think that's a dumb idea from a theoretical perspective, but in practice I think it's going to be very difficult to implement. We have no other DDL that has such semantics, and there's no reason we couldn't; for example, TRUNCATE could work with SUEL and transactions that can't see the TRUNCATE as committed continue to operate on the old heap. While we could do such things, we don't. If you decide to do them here, you've probably got a lot of work ahead of you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company