Re: ATTACH/DETACH PARTITION CONCURRENTLY

Robert Haas Mon, 20 Aug 2018 19:01:30 -0700

On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> I wonder if this all stems from a misunderstanding of what I suggested
> to David offlist.  My suggestion was that the catalog scans would
> continue to use the catalog MVCC snapshot, and that the relcache entries
> would contain all the partitions that appear to the catalog; but each
> partition's entry would carry the Xid of the creating transaction in a
> field (say xpart), and that field is compared to the regular transaction
> snapshot: if xpart is visible to the transaction snapshot, then the
> partition is visible, otherwise not.  So you never try to access a
> partition that doesn't exist, because those just don't appear at all in
> the relcache entry.  But if you have an old transaction running with an
> old snapshot, and the partitioned table just acquired a new partition,
> then whether the partition will be returned as part of the partition
> descriptor or not depends on the visibility of its entry.


Hmm.  One question is where you're going to get the XID of the
creating transaction.  If it's taken from the pg_class row or the
pg_inherits row or something of that sort, then you risk getting a
bogus value if something updates that row other than what you expect
-- and the consequences of that are pretty bad here; for this to work
as you intend, you need an exactly-correct value, not newer or older.
An alternative is to add an xid field that stores the value
explicitly, and that might work, but you'll have to arrange for that
value to be frozen at the appropriate time.

A further problem is that there could be multiple changes in quick
succession.  Suppose that a partition is attached, then detached
before the attach operation is all-visible, then reattached, perhaps
with different partition bounds.

> I think that works fine for ATTACH without any further changes.  I'm not
> so sure about DETACH, particularly when snapshots persist for a "long
> time" (a repeatable-read transaction).  ISTM that in the above design,
> the partition descriptor would lose the entry for the detached partition
> ahead of time, which means queries would silently fail to see their data
> (though they wouldn't crash).

I don't see why they wouldn't crash.  If the partition descriptor gets
rebuilt and some partitions disappear out from under you, the old
partition descriptor is going to get freed, and the executor has a
cached pointer to it, so it seems like you are in trouble.

> I first thought this could be fixed by
> waiting for those snapshots to finish, but then I realized that there's
> no actual place where waiting achieves anything.  Certainly it's not
> useful to wait before commit (because other snapshots are going to be
> starting all the time), and it's not useful to start after the commit
> (because by then the catalog tuple is already gone).  Maybe we need two
> transactions: mark partition as removed with an xmax of sorts, commit,
> wait for all snapshots, start transaction, remove partition catalog
> tuple, commit.

And what would that accomplish, exactly?  Waiting for all snapshots
would ensure that all still-running transactions see the fact the xmax
with which the partition has been marked as removed, but what good
does that do?  In order to have a plausible algorithm, you have to
describe both what the ATTACH/DETACH operation does and what the other
concurrent transactions do and how those things interact.  Otherwise,
it's like saying that we're going to solve a problem with X and Y
overlapping by having X take a lock.  If Y doesn't take a conflicting
lock, this does nothing.

Generally, I think I see what you're aiming at: make ATTACH and DETACH
have MVCC-like semantics with respect to concurrent transactions.  I
don't think that's a dumb idea from a theoretical perspective, but in
practice I think it's going to be very difficult to implement.  We
have no other DDL that has such semantics, and there's no reason we
couldn't; for example, TRUNCATE could work with SUEL and transactions
that can't see the TRUNCATE as committed continue to operate on the
old heap.  While we could do such things, we don't.  If you decide to
do them here, you've probably got a lot of work ahead of you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Reply via email to