From 6a2e2e0cab925b6310230274c779c3bbd53707a0 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 23 Nov 2018 12:06:20 +1300
Subject: [PATCH v1] Delay locking of partitions during INSERT and UPDATE

During INSERT, even if we were inserting a single row into a partitioned
table, we would obtain a lock on every partition which was a direct or
an indirect partition of the insert target table.  This was done in order
to provide a consistent order to the locking of the partitions, which
happens to be the same order that partitions are locked during planning.
The problem with locking all these partitions was that if a partitioned
table had many partitions and the INSERT inserted one, or just a few rows,
the overhead of the locking was significantly more than the inserting the
actual rows.

This commit changes the locking to only lock partitions the first time we
route a tuple to them, so if you insert one row, then only 1 leaf
partition will be locked, plus any sub-partitioned tables that we search
through before we find the correct home of the tuple.  This does mean that
the locking order of partitions during INSERT does become less well
defined. Previously operations such as CREATE INDEX and TRUNCATE when
performed on leaf partitions could defend against deadlocking with
concurrent INSERT by performing the operation in table oid order. However,
to deadlock, such DDL would have had to have been performed inside a
transaction and not in table oid order.  With this commit it's now possible
to get deadlocks even if the DDL is performed in table oid order.   If
required such transactions can defend against such deadlocks by performing
a LOCK TABLE on the partitioned table before performing the DDL.

Currently, only INSERTs are affected by this change as UPDATEs to a
partitioned table still obtain locks on all partitions either during
planning or during AcquireExecutorLocks, however, there are upcoming
patches which may change this too.
---
 src/backend/executor/execPartition.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 179a501f30..5d2b25bdf1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -190,9 +190,6 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * tuple routing for partitioned tables, encapsulates it in
  * PartitionTupleRouting, and returns it.
  *
- * Note that all the relations in the partition tree are locked using the
- * RowExclusiveLock mode upon return from this function.
- *
  * Callers must use the returned PartitionTupleRouting during calls to
  * ExecFindPartition().  The actual ResultRelInfo for a partition is only
  * allocated when the partition is found for the first time.
@@ -207,9 +204,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/* Lock all the partitions. */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
 	/*
 	 * Here we attempt to expend as little effort as possible in setting up
 	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
@@ -509,11 +503,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 
 	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
-	/*
-	 * We locked all the partitions in ExecSetupPartitionTupleRouting
-	 * including the leaf partitions.
-	 */
-	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], RowExclusiveLock);
 
 	leaf_part_rri = makeNode(ResultRelInfo);
 	InitResultRelInfo(leaf_part_rri,
@@ -982,7 +972,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
 	if (partoid != RelationGetRelid(proute->partition_root))
-		rel = heap_open(partoid, NoLock);
+		rel = heap_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
 	partdesc = RelationGetPartitionDesc(rel);
-- 
2.16.2.windows.1

