Executing generic plans involving partitions is known to become slower as partition count grows due to a number of bottlenecks, with AcquireExecutorLocks() showing at the top in profiles.
Previous attempt at solving that problem was by David Rowley [1], where he proposed delaying locking of *all* partitions appearing under an Append/MergeAppend until "initial" pruning is done during the executor initialization phase. A problem with that approach that he has described in [2] is that leaving partitions unlocked can lead to race conditions where the Plan node belonging to a partition can be invalidated when a concurrent session successfully alters the partition between AcquireExecutorLocks() saying the plan is okay to execute and then actually executing it. However, using an idea that Robert suggested to me off-list a little while back, it seems possible to determine the set of partitions that we can safely skip locking. The idea is to look at the "initial" or "pre-execution" pruning instructions contained in a given Append or MergeAppend node when AcquireExecutorLocks() is collecting the relations to lock and consider relations from only those sub-nodes that survive performing those instructions. I've attempted implementing that idea in the attached patch. Note that "initial" pruning steps are now performed twice when executing generic plans: once in AcquireExecutorLocks() to find partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to determine the set of partition sub-nodes to be initialized for execution, though I wasn't able to come up with a good idea to avoid this duplication. Using the following benchmark setup: pgbench testdb -i --partitions=$nparts > /dev/null 2>&1 pgbench -n testdb -S -T 30 -Mprepared And plan_cache_mode = force_generic_plan, I get following numbers: HEAD: 32 tps = 20561.776403 (without initial connection time) 64 tps = 12553.131423 (without initial connection time) 128 tps = 13330.365696 (without initial connection time) 256 tps = 8605.723120 (without initial connection time) 512 tps = 4435.951139 (without initial connection time) 1024 tps = 2346.902973 (without initial connection time) 2048 tps = 1334.680971 (without initial connection time) Patched: 32 tps = 27554.156077 (without initial connection time) 64 tps = 27531.161310 (without initial connection time) 128 tps = 27138.305677 (without initial connection time) 256 tps = 25825.467724 (without initial connection time) 512 tps = 19864.386305 (without initial connection time) 1024 tps = 18742.668944 (without initial connection time) 2048 tps = 16312.412704 (without initial connection time) -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CAKJS1f_kfRQ3ZpjQyHC7=pk9vrhxihbqfz+hc0jcwwnrkkf...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f99JNe%2Bsw5E3qWmS%2BHeLMFaAhehKO67J1Ym3pXv0XBsxw%40mail.gmail.com
v1-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch
Description: Binary data