ModifyTable overheads in generic plans

Amit Langote Fri, 26 Jun 2020 05:36:40 -0700

Hi,

I would like to discuss a refactoring patch that builds on top of the
patches at [1] to address $subject.  To get an idea for what
eliminating these overheads looks like, take a look at the following
benchmarking results.


Note 1: I've forced the use of generic plan by setting plan_cache_mode
to 'force_generic_plan'

Note 2: The individual TPS figures as measured are quite noisy, though
I just want to show the rough trend with increasing number of
partitions.

pgbench -i -s 10 --partitions={0, 10, 100, 1000}
pgbench -T120 -f test.sql -M prepared

test.sql:
\set aid random(1, 1000000)
update pgbench_accounts set abalance = abalance + 1 where aid = :aid;

Without any of the patches:

0       tps = 13045.485121 (excluding connections establishing)
10      tps = 9358.157433 (excluding connections establishing)
100     tps = 1878.274500 (excluding connections establishing)
1000    tps = 84.684695 (excluding connections establishing)

The slowdown as the partition count increases can be explained by the
fact that UPDATE and DELETE can't currently use runtime partition
pruning.  So, even if any given transaction is only updating a single
tuple in a single partition, the plans for *all* partitions are being
initialized and also the ResultRelInfos.  That is, a lot of useless
work being done in InitPlan() and ExecInitModifyTable().

With the patches at [1] (latest 0001+0002 posted there), whereby the
generic plan for UPDATE can now perform runtime pruning, numbers can
be seen to improve, slightly:

0       tps = 12743.487196 (excluding connections establishing)
10      tps = 12644.240748 (excluding connections establishing)
100     tps = 4158.123345 (excluding connections establishing)
1000    tps = 391.248067 (excluding connections establishing)

So even though runtime pruning enabled by those patches ensures that
the useless plans are left untouched by the executor, the
ResultRelInfos are still being made assuming *all* result relations
will be processed.  With the attached patches (0001+0002+0003) that I
want to discuss here in this thread, numbers are further improved:

0       tps = 13419.283168 (excluding connections establishing)
10      tps = 12588.016095 (excluding connections establishing)
100     tps = 8560.824225 (excluding connections establishing)
1000    tps = 1926.553901 (excluding connections establishing)

0001 and 0002 are preparatory patches.  0003 teaches nodeModifyTable.c
to make the ResultRelInfo for a given result relation lazily, that is,
when the plan producing tuples to be updated/deleted actually produces
one that belongs to that relation.  So, if a transaction only updates
one tuple, then only one ResultRelInfo would be made.  For larger
partition counts, that saves significant amount of work.

However, there's one new loop in ExecInitModifyTable() added by the
patches at [1] that loops over all partitions, which I haven't been
able to eliminate so far and I'm seeing it cause significant
bottleneck at higher partition counts.  The loop is meant to create a
hash table that maps result relation OIDs to their offsets in the
PlannedStmt.resultRelations list.  We need this mapping, because the
ResultRelInfos are accessed from the query-global array using that
offset.  One approach that was mentioned by David Rowley at [1] to not
have do this mapping is to make the result relation's scan node's
targetlist emit the relation's RT index or ordinal position to begin
with, instead of the table OID, but I haven't figured out a way to do
that.

Having taken care of the ModifyTable overheads (except the one
mentioned in the last paragraph), a few more bottlenecks are seen to
pop up at higher partition counts.  Basically, they result from doing
some pre-execution actions on relations contained in the plan by
traversing the flat range table in whole.

1. AcquireExecutorLocks(): locks *all* partitions before executing the
plan tree but runtime pruning allows to skip scanning all but one

2. ExecCheckRTPerms(): checks permissions of *all* partitions before
executing the plan tree, but maybe it's okay to check only the ones
that will be accessed

Problem 1 has been discussed before and David Rowley even developed a
patch that was discussed at [2].  The approach taken in the patch was
to delay locking of the partitions contained in a generic plan that
are potentially runtime pruneable, although as also described in the
linked thread, that approach has a race condition whereby a concurrent
session may invalidate the generic plan by altering a partition in the
window between when AcquireExecutorLocks() runs on the plan and the
plan is executed.

Another solution suggested to me by Robert Haas in an off-list
discussion is to teach AcquireExecutorLocks() or the nearby code to
perform EXTERN parameter based pruning before passing the plan tree to
the executor and lock partitions that survive that pruning.  It's
perhaps doable if we refactor the ExecFindInitialMatchingSubPlans() to
not require a full-blown execution context.  Or maybe we could do
something more invasive by rewriting AcquireExecutorLocks() to walk
the plan tree instead of the flat range table, looking for scan nodes
and nodes that support runtime pruning to lock the appropriate
relations.

Regarding problem 2, I wonder if we shouldn't simply move the
permission check to ExecGetRangeTableRelation(), which will be
performed the first time a given relation is accessed during
execution.

Anyway, applying David's aforementioned patch gives the following numbers:

0       tps = 12325.890487 (excluding connections establishing)
10      tps = 12146.420443 (excluding connections establishing)
100     tps = 12807.850709 (excluding connections establishing)
1000    tps = 7578.652893 (excluding connections establishing)

which suggests that it might be worthwhile try to find a solution for this.

Finally, there's one more place that shows up in perf profiles at
higher partition counts and that is LockReleaseAll().  David,
Tsunakawa-san had worked on a patch [3], which still applies and can
be shown to be quite beneficial when generic plans are involved.  I
couldn't get it to show major improvement over the above numbers in
this case (for UPDATE that is), but maybe that's because the loop in
ExecInitModifyTable() mentioned above is still in the way.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/28/2575/
[2] 
https://www.postgresql.org/message-id/flat/CAKJS1f_kfRQ3ZpjQyHC7%3DPK9vrhxiHBQFZ%2Bhc0JCwwnRKkF3hg%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CAKJS1f-7T9F1xLw5PqgOApcV6YX3WYC4XJHHCpxh8hzcZsA-xA%40mail.gmail.com#c57f2f592484bca76310f4c551d4de15

v1-0002-Do-not-set-rootResultRelIndex-unnecessarily.patch
Description: Binary data

v1-0001-Revise-how-some-FDW-executor-APIs-obtain-ResultRe.patch
Description: Binary data

v1-0003-Delay-initializing-UPDATE-DELETE-ResultRelInfos.patch
Description: Binary data

ModifyTable overheads in generic plans

Reply via email to