On 3 August 2018 at 01:25, David Rowley <david.row...@2ndquadrant.com> wrote: > 1. Do all the normal partition attach partition validation. > 2. Insert a record into pg_partition with partisvalid=false > 3. Obtain a session-level ShareUpdateExclusiveLock on the partitioned table. > 4. Obtain a session-level AccessExclusiveLock on the partition being attached. > 5. Commit. > 6. Start a new transaction. > 7. Wait for snapshots older than our own to be released. > 8. Mark the partition as valid > 9. Invalidate relcache for the partitioned table. > 10. release session-level locks.
So I was thinking about this again and realised this logic is broken. All it takes is a snapshot that starts after the ATTACH PARTITION started and before it completed. This snapshot will have the new partition attached while it's possibly still open which could lead to non-repeatable reads in a repeatable read transaction. The window for this to occur is possibly quite large given that the ATTACH CONCURRENTLY can wait a long time for older snapshots to finish. Here's my updated thinking for an implementation which seems to get around the above problem: ATTACH PARTITION CONCURRENTLY: 1. Obtain a ShareUpdateExclusiveLock on the partitioned table rather than an AccessExclusiveLock. 2. Do all the normal partition attach partition validation. 3. Insert pg_partition record with partvalid = true. 4. Invalidate relcache entry for the partitioned table 5. Any loops over a partitioned table's PartitionDesc must check PartitionIsValid(). This will return true if the current snapshot should see the partition or not. The partition is valid if partisvalid = true and the xmin precedes or is equal to the current snapshot. #define PartitionIsValid(pd, i) (((pd)->is_valid[(i)] \ && TransactionIdPrecedesOrEquals((pd)->xmin[(i)], GetCurrentTransactionId())) \ || (!(pd)->is_valid[(i)] \ && TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)]))) DETACH PARTITION CONCURRENTLY: 1. Obtain ShareUpdateExclusiveLock on partition being detached (instead of the AccessShareLock that non-concurrent detach uses) 2. Update the pg_partition record, set partvalid = false. 3. Commit 4. New transaction. 5. Wait for transactions which hold a snapshot older than the one held when updating pg_partition to complete. 6. Delete the pg_partition record. 7. Perform other cleanup, relpartitionparent = 0, pg_depend etc. DETACH PARTITION CONCURRENTLY failure when it fails after step 3 (above) 1. Make vacuum of a partition check for pg_partition.partvalid = false, if xmin of tuple is old enough, perform a partition cleanup by doing steps 6+7 above. A VACUUM FREEZE must run before transaction wraparound, so this means a partition can never reattach itself when the transaction counter wraps. I believe I've got the attach and detach working correctly now and also isolation tests that appear to prove it works. I've also written the failed detach cleanup code into vacuum. Unusually, since foreign tables can also be partitions this required teaching auto-vacuum to look at foreign tables, only in the sense of checking for failed detached partitions. It also requires adding vacuum support for foreign tables too. It feels a little bit weird to modify auto-vacuum to look at foreign tables, but I really couldn't see another way to do this. I'm now considering if this all holds together in the event the pg_partition tuple of an invalid partition becomes frozen. The problem would be that PartitionIsValid() could return the wrong value due to TransactionIdPrecedesOrEquals(GetCurrentTransactionId(), (pd)->xmin[(i)]). this code is trying to keep the detached partition visible to older snapshots, but if pd->xmin[i] becomes frozen, then the partition would become invisible. However, I think this won't be a problem since a VACUUM FREEZE would only freeze tuples that are also old enough to have failed detaches cleaned up earlier in the vacuum process. Also, we must disallow a DEFAULT partition from being attached to a partition with a failed DETACH CONCURRENTLY as it wouldn't be very clear what the default partition's partition qual would be, as this is built based on the quals of all attached partitions. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services