Hi, I would like to discuss a refactoring patch that builds on top of the patches at [1] to address $subject. To get an idea for what eliminating these overheads looks like, take a look at the following benchmarking results.
Note 1: I've forced the use of generic plan by setting plan_cache_mode to 'force_generic_plan' Note 2: The individual TPS figures as measured are quite noisy, though I just want to show the rough trend with increasing number of partitions. pgbench -i -s 10 --partitions={0, 10, 100, 1000} pgbench -T120 -f test.sql -M prepared test.sql: \set aid random(1, 1000000) update pgbench_accounts set abalance = abalance + 1 where aid = :aid; Without any of the patches: 0 tps = 13045.485121 (excluding connections establishing) 10 tps = 9358.157433 (excluding connections establishing) 100 tps = 1878.274500 (excluding connections establishing) 1000 tps = 84.684695 (excluding connections establishing) The slowdown as the partition count increases can be explained by the fact that UPDATE and DELETE can't currently use runtime partition pruning. So, even if any given transaction is only updating a single tuple in a single partition, the plans for *all* partitions are being initialized and also the ResultRelInfos. That is, a lot of useless work being done in InitPlan() and ExecInitModifyTable(). With the patches at [1] (latest 0001+0002 posted there), whereby the generic plan for UPDATE can now perform runtime pruning, numbers can be seen to improve, slightly: 0 tps = 12743.487196 (excluding connections establishing) 10 tps = 12644.240748 (excluding connections establishing) 100 tps = 4158.123345 (excluding connections establishing) 1000 tps = 391.248067 (excluding connections establishing) So even though runtime pruning enabled by those patches ensures that the useless plans are left untouched by the executor, the ResultRelInfos are still being made assuming *all* result relations will be processed. With the attached patches (0001+0002+0003) that I want to discuss here in this thread, numbers are further improved: 0 tps = 13419.283168 (excluding connections establishing) 10 tps = 12588.016095 (excluding connections establishing) 100 tps = 8560.824225 (excluding connections establishing) 1000 tps = 1926.553901 (excluding connections establishing) 0001 and 0002 are preparatory patches. 0003 teaches nodeModifyTable.c to make the ResultRelInfo for a given result relation lazily, that is, when the plan producing tuples to be updated/deleted actually produces one that belongs to that relation. So, if a transaction only updates one tuple, then only one ResultRelInfo would be made. For larger partition counts, that saves significant amount of work. However, there's one new loop in ExecInitModifyTable() added by the patches at [1] that loops over all partitions, which I haven't been able to eliminate so far and I'm seeing it cause significant bottleneck at higher partition counts. The loop is meant to create a hash table that maps result relation OIDs to their offsets in the PlannedStmt.resultRelations list. We need this mapping, because the ResultRelInfos are accessed from the query-global array using that offset. One approach that was mentioned by David Rowley at [1] to not have do this mapping is to make the result relation's scan node's targetlist emit the relation's RT index or ordinal position to begin with, instead of the table OID, but I haven't figured out a way to do that. Having taken care of the ModifyTable overheads (except the one mentioned in the last paragraph), a few more bottlenecks are seen to pop up at higher partition counts. Basically, they result from doing some pre-execution actions on relations contained in the plan by traversing the flat range table in whole. 1. AcquireExecutorLocks(): locks *all* partitions before executing the plan tree but runtime pruning allows to skip scanning all but one 2. ExecCheckRTPerms(): checks permissions of *all* partitions before executing the plan tree, but maybe it's okay to check only the ones that will be accessed Problem 1 has been discussed before and David Rowley even developed a patch that was discussed at [2]. The approach taken in the patch was to delay locking of the partitions contained in a generic plan that are potentially runtime pruneable, although as also described in the linked thread, that approach has a race condition whereby a concurrent session may invalidate the generic plan by altering a partition in the window between when AcquireExecutorLocks() runs on the plan and the plan is executed. Another solution suggested to me by Robert Haas in an off-list discussion is to teach AcquireExecutorLocks() or the nearby code to perform EXTERN parameter based pruning before passing the plan tree to the executor and lock partitions that survive that pruning. It's perhaps doable if we refactor the ExecFindInitialMatchingSubPlans() to not require a full-blown execution context. Or maybe we could do something more invasive by rewriting AcquireExecutorLocks() to walk the plan tree instead of the flat range table, looking for scan nodes and nodes that support runtime pruning to lock the appropriate relations. Regarding problem 2, I wonder if we shouldn't simply move the permission check to ExecGetRangeTableRelation(), which will be performed the first time a given relation is accessed during execution. Anyway, applying David's aforementioned patch gives the following numbers: 0 tps = 12325.890487 (excluding connections establishing) 10 tps = 12146.420443 (excluding connections establishing) 100 tps = 12807.850709 (excluding connections establishing) 1000 tps = 7578.652893 (excluding connections establishing) which suggests that it might be worthwhile try to find a solution for this. Finally, there's one more place that shows up in perf profiles at higher partition counts and that is LockReleaseAll(). David, Tsunakawa-san had worked on a patch [3], which still applies and can be shown to be quite beneficial when generic plans are involved. I couldn't get it to show major improvement over the above numbers in this case (for UPDATE that is), but maybe that's because the loop in ExecInitModifyTable() mentioned above is still in the way. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com [1] https://commitfest.postgresql.org/28/2575/ [2] https://www.postgresql.org/message-id/flat/CAKJS1f_kfRQ3ZpjQyHC7%3DPK9vrhxiHBQFZ%2Bhc0JCwwnRKkF3hg%40mail.gmail.com [3] https://www.postgresql.org/message-id/flat/CAKJS1f-7T9F1xLw5PqgOApcV6YX3WYC4XJHHCpxh8hzcZsA-xA%40mail.gmail.com#c57f2f592484bca76310f4c551d4de15
v1-0002-Do-not-set-rootResultRelIndex-unnecessarily.patch
Description: Binary data
v1-0001-Revise-how-some-FDW-executor-APIs-obtain-ResultRe.patch
Description: Binary data
v1-0003-Delay-initializing-UPDATE-DELETE-ResultRelInfos.patch
Description: Binary data