Thanks a lot for all the comments.
On 2022-11-29 3:13 p.m., Tom Lane wrote:
... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions. If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail. The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
In the current POC patch, the deadlock is happening when backend-1
inserts a value to index X(partition-1), and backend-2 try to insert a
conflict value right after backend-1 released the buffer block lock but
before start to check unique on index Y(partition-2). In this case,
backend-1 holds ExclusiveLock on transaction-1 and waits for ShareLock
on transaction-2 , while backend-2 holds ExclusiveLock on transaction-2
and waits for ShareLock on transaction-1. Based on my debugging tests,
this only happens when backend-1 and backend-2 want to insert a conflict
value. If this is true, then is it ok to either `deadlock` error out or
`duplicated value` error out since this is a conflict value? (hopefully
end users can handle it in a similar way). I think the probability of
such deadlock has two conditions: 1) users insert a conflict value and
plus 2) the uniqueness checking happens in the right moment (see above).
That specific issue could perhaps be fixed by having everybody
examine all the indexes in the same order, inserting when you
come to your own partition's index and otherwise just checking
for conflicts. But that still means serializing insertions
across all the partitions. And the fact that you need to lock
all the partitions, or even just know what they all are,
Here is the main change for insertion cross-partition uniqueness check
in `0004-support-global-unique-index-insert-and-update.patch`,
result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged,
heapRel);
+ if (checkUnique != UNIQUE_CHECK_NO)
+ btinsert_check_unique_gi(itup, rel, heapRel, checkUnique);
+
pfree(itup);
where, a cross-partition uniqueness check is added after the index tuple
btree insertion on current partition. The idea is to make sure other
backends can find out the ongoing index tuple just inserted (but before
marked as visible yet), and the current partition uniqueness check can
be skipped as it has already been checked. Based on this change, I think
the insertion serialization can happen in two cases: 1) two insertions
happen on the same buffer block (buffer lock waiting); 2) two ongoing
insertions with duplicated values (transaction id waiting);