On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepi...@gmail.com> wrote: > > On 4/6/2025 00:41, Alexander Korotkov wrote: > > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepi...@gmail.com> wrote: > >> For me, it seems like a continuation of the 7d8ac98 discussion. We may > >> charge a small fee for MergeAppend to adjust the balance, of course. > >> However, I think this small change requires a series of benchmarks to > >> determine how it affects the overall cost balance. Without examples it > >> is hard to say how important this issue is and its worthiness to > >> commence such work. > > > > Yes, I think it's fair to charge the MergeAppend node. We currently > > cost it similarly to Sort merge stage, but it's clearly more > > expensive. It dealing on the executor level dealing with Slot's etc, > > while Sort node have a set of lower level optimizations. > After conducting additional research, I concluded that you are correct, > and the current cost model doesn't allow the optimiser to detect the > best option. A simple test with a full scan and sort of a partitioned > table (see attachment) shows that the query plan prefers small sortings > merged by the MergeAppend node. I have got the following results for > different numbers of tuples to be sorted (in the range from 10 tuples to > 1E8): > > EXPLAIN SELECT * FROM test ORDER BY y; > > 1E1: Sort (cost=9.53..9.57 rows=17 width=8) > 1E2: Sort (cost=20.82..21.07 rows=100) > 1E3: Merge Append (cost=56.19..83.69 rows=1000) > 1E4: Merge Append (cost=612.74..887.74 rows=10000) > 1E5: Merge Append (cost=7754.19..10504.19 rows=100000) > 1E6: Merge Append (cost=94092.25..121592.25 rows=1000000) > 1E7: Merge Append (cost=1106931.22..1381931.22 rows=10000000) > 1E8: Merge Append (cost=14097413.40..16847413.40 rows=100000000) > > That happens because both total costs lie within the fuzzy factor gap, > and the optimiser chooses the path based on the startup cost, which is > obviously better for the MergeAppend case. > > At the same time, execution, involving a single Sort node, dominates the > MergeAppend case: > > 1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms > 1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms > 1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms > 1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms > 1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms > 1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms > > Looking inside the code, I found the only difference we can employ to > rationalise the cost model change: re-structuring of a heap. The > siftDown routine employs two tuple comparisons to find the proper > position for a tuple. So, we have objections to changing the constant in > the cost model of the merge operation: > > @@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root, > logN = LOG2(N); > > /* Assumed cost per tuple comparison */ > - comparison_cost = 2.0 * cpu_operator_cost; > + comparison_cost = 4.0 * cpu_operator_cost; > > /* Heap creation cost */ > startup_cost += comparison_cost * N * logN; > > The exact change also needs to be made in the cost_gather_merge > function, of course. > At this moment, I'm not sure that it should be multiplied by 2 - it is a > subject for further discussion. However, it fixes the current issue and > adds a little additional cost to the merge operation, which, as I see > it, is quite limited. > Please see the new version of the patch attached.
I've checked the cost adjustment you've made. If you change the cost of top-N sort, you must also change the prior if condition on when it gets selected. We currently do the switch on tuples = 2 * limit_tuples. If we apply the proposed change, we should switch on tuples = limit_tuples^2 ^ 2. But also, in order for this cost to reflect reality, we must change tuplesort_puttuple_common() in the same way. These inconsistencies lead to failures on contrib/postgres_fdw checks. But these changes don't appear to be a win. See the example below. Top-N sort appears to be a win for LIMITS up to 500000. # EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 10000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=112157.85..112182.85 rows=10000 width=8) (actual time=638.998..639.841 rows=10000.00 loops=1) -> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=638.996..639.340 rows=10000.00 loops=1) Sort Key: (random()) Sort Method: quicksort Memory: 24577kB -> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=126.582..205.610 rows=1000000.00 loops=1) Planning Time: 0.118 ms Execution Time: 653.283 ms (7 rows) # EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 500000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual time=646.522..688.573 rows=500000.00 loops=1) -> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=646.520..663.562 rows=500000.00 loops=1) Sort Key: (random()) Sort Method: quicksort Memory: 24577kB -> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=129.028..208.936 rows=1000000.00 loops=1) Planning Time: 0.188 ms Execution Time: 713.738 ms (7 rows) # EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 10000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=78938.56..78963.56 rows=10000 width=8) (actual time=412.633..413.459 rows=10000.00 loops=1) Buffers: shared hit=3, temp read=1709 written=1709 -> Sort (cost=78938.56..81438.56 rows=1000000 width=8) (actual time=412.631..412.969 rows=10000.00 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 769kB Buffers: shared hit=3, temp read=1709 written=1709 -> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=185.892..333.233 rows=1000000.00 loops=1) Buffers: temp read=1709 written=1709 Planning: Buffers: shared hit=7 read=8 Planning Time: 2.058 ms Execution Time: 416.040 ms (12 rows) # EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1, 1000000) i) x ORDER BY x.r LIMIT 500000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=112157.85..113407.85 rows=500000 width=8) (actual time=631.185..673.768 rows=500000.00 loops=1) -> Sort (cost=112157.85..114657.85 rows=1000000 width=8) (actual time=631.183..649.072 rows=500000.00 loops=1) Sort Key: (random()) Sort Method: quicksort Memory: 24577kB -> Function Scan on generate_series i (cost=0.00..12500.00 rows=1000000 width=8) (actual time=121.274..200.453 rows=1000000.00 loops=1) Planning Time: 0.243 ms Execution Time: 698.841 ms (7 rows) I've another idea. cost_tuplesort() puts 2.0 under logarithm to prefer tuplesort over heapsort. I think we can adjust cost_gather_merge() and cost_merge_append() to do the same. 0001 patch implements that. I think the plan changes of 0001 might be reasonable since most cases deal with small rowsets. One thing concerns me: 0002 still affects one of the postgres_fdw checks. Could you, please, take a look? ------ Regards, Alexander Korotkov Supabase
v8-0001-Add-some-penatly-to-gather-merge-and-merge-append.patch
Description: Binary data
v8-0002-Consider-an-explicit-sort-of-the-MergeAppend-subp.patch
Description: Binary data