Thanks a lot for all the comments.

On 2022-11-29 3:13 p.m., Tom Lane wrote:
... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions.  If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail.  The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
In the current POC patch, the deadlock is happening when backend-1 inserts a value to index X(partition-1), and backend-2 try to insert a conflict value right after backend-1 released the buffer block lock but before start to check unique on index Y(partition-2). In this case, backend-1 holds ExclusiveLock on transaction-1 and waits for ShareLock on transaction-2 , while backend-2 holds ExclusiveLock on transaction-2 and waits for ShareLock on transaction-1. Based on my debugging tests, this only happens when backend-1 and backend-2 want to insert a conflict value. If this is true, then is it ok to either `deadlock` error out or `duplicated value` error out since this is a conflict value? (hopefully end users can handle it in a similar way). I think the probability of such deadlock has two conditions: 1) users insert a conflict value and plus 2) the uniqueness checking happens in the right moment (see above).
That specific issue could perhaps be fixed by having everybody
examine all the indexes in the same order, inserting when you
come to your own partition's index and otherwise just checking
for conflicts.  But that still means serializing insertions
across all the partitions.  And the fact that you need to lock
all the partitions, or even just know what they all are,
Here is the main change for insertion cross-partition uniqueness check in `0004-support-global-unique-index-insert-and-update.patch`,      result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);

+    if (checkUnique != UNIQUE_CHECK_NO)
+        btinsert_check_unique_gi(itup, rel, heapRel, checkUnique);
+
     pfree(itup);

where, a cross-partition uniqueness check is added after the index tuple btree insertion on current partition. The idea is to make sure other backends can find out the ongoing index tuple just inserted (but before marked as visible yet), and the current partition uniqueness check can be skipped as it has already been checked. Based on this change, I think the insertion serialization can happen in two cases: 1) two insertions happen on the same buffer block (buffer lock waiting); 2) two ongoing insertions with duplicated values (transaction id waiting);




Reply via email to