Re: MergeAppend could consider sorting cheapest child path

Alexander Korotkov Sat, 26 Jul 2025 15:51:51 -0700

On Tue, Jul 22, 2025 at 2:13 PM Andrei Lepikhov <lepi...@gmail.com> wrote:
>
> On 4/6/2025 00:41, Alexander Korotkov wrote:
> > On Tue, Jun 3, 2025 at 5:35 PM Andrei Lepikhov <lepi...@gmail.com>
wrote:
> >> For me, it seems like a continuation of the 7d8ac98 discussion. We may
> >> charge a small fee for MergeAppend to adjust the balance, of course.
> >> However, I think this small change requires a series of benchmarks to
> >> determine how it affects the overall cost balance. Without examples it
> >> is hard to say how important this issue is and its worthiness to
> >> commence such work.
> >
> > Yes, I think it's fair to charge the MergeAppend node.  We currently
> > cost it similarly to Sort merge stage, but it's clearly more
> > expensive.  It dealing on the executor level dealing with Slot's etc,
> > while Sort node have a set of lower level optimizations.
> After conducting additional research, I concluded that you are correct,
> and the current cost model doesn't allow the optimiser to detect the
> best option. A simple test with a full scan and sort of a partitioned
> table (see attachment) shows that the query plan prefers small sortings
> merged by the MergeAppend node. I have got the following results for
> different numbers of tuples to be sorted (in the range from 10 tuples to
> 1E8):
>
> EXPLAIN SELECT * FROM test ORDER BY y;
>
> 1E1: Sort  (cost=9.53..9.57 rows=17 width=8)
> 1E2: Sort  (cost=20.82..21.07 rows=100)
> 1E3: Merge Append  (cost=56.19..83.69 rows=1000)
> 1E4: Merge Append  (cost=612.74..887.74 rows=10000)
> 1E5: Merge Append  (cost=7754.19..10504.19 rows=100000)
> 1E6: Merge Append  (cost=94092.25..121592.25 rows=1000000)
> 1E7: Merge Append  (cost=1106931.22..1381931.22 rows=10000000)
> 1E8: Merge Append  (cost=14097413.40..16847413.40 rows=100000000)
>
> That happens because both total costs lie within the fuzzy factor gap,
> and the optimiser chooses the path based on the startup cost, which is
> obviously better for the MergeAppend case.
>
> At the same time, execution, involving a single Sort node, dominates the
> MergeAppend case:
>
> 1E3: MergeAppend: 1.927 ms, Sort: 0.720 ms
> 1E4: MergeAppend: 10.090 ms, Sort: 7.583 ms
> 1E5: MergeAppend: 118.885 ms, Sort: 88.492 ms
> 1E6: MergeAppend: 1372.717 ms, Sort: 1106.184 ms
> 1E7: MergeAppend: 15103.893 ms, Sort: 13415.806 ms
> 1E8: MergeAppend: 176553.133 ms, Sort: 149458.787 ms
>
> Looking inside the code, I found the only difference we can employ to
> rationalise the cost model change: re-structuring of a heap. The
> siftDown routine employs two tuple comparisons to find the proper
> position for a tuple. So, we have objections to changing the constant in
> the cost model of the merge operation:
>
> @@ -2448,7 +2448,7 @@ cost_merge_append(Path *path, PlannerInfo *root,
>          logN = LOG2(N);
>
>          /* Assumed cost per tuple comparison */
> -       comparison_cost = 2.0 * cpu_operator_cost;
> +       comparison_cost = 4.0 * cpu_operator_cost;
>
>          /* Heap creation cost */
>          startup_cost += comparison_cost * N * logN;
>
> The exact change also needs to be made in the cost_gather_merge
> function, of course.
> At this moment, I'm not sure that it should be multiplied by 2 - it is a
> subject for further discussion. However, it fixes the current issue and
> adds a little additional cost to the merge operation, which, as I see
> it, is quite limited.
> Please see the new version of the patch attached.


I've checked the cost adjustment you've made.  If you change the cost of
top-N sort, you must also change the prior if condition on when it gets
selected.  We currently do the switch on tuples =  2 * limit_tuples.  If we
apply the proposed change, we should switch on tuples = limit_tuples^2 ^
2.  But also, in order for this cost to reflect reality, we must change
tuplesort_puttuple_common() in the same way.  These inconsistencies lead to
failures on contrib/postgres_fdw checks.  But these changes don't appear to
be a win.  See the example below.  Top-N sort appears to be a win for
LIMITS up to 500000.

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 10000;
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=112157.85..112182.85 rows=10000 width=8) (actual
time=638.998..639.841 rows=10000.00 loops=1)
   ->  Sort  (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=638.996..639.340 rows=10000.00 loops=1)
         Sort Key: (random())
         Sort Method: quicksort  Memory: 24577kB
         ->  Function Scan on generate_series i  (cost=0.00..12500.00
rows=1000000 width=8) (actual time=126.582..205.610 rows=1000000.00 loops=1)
 Planning Time: 0.118 ms
 Execution Time: 653.283 ms
(7 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 500000;
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=112157.85..113407.85 rows=500000 width=8) (actual
time=646.522..688.573 rows=500000.00 loops=1)
   ->  Sort  (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=646.520..663.562 rows=500000.00 loops=1)
         Sort Key: (random())
         Sort Method: quicksort  Memory: 24577kB
         ->  Function Scan on generate_series i  (cost=0.00..12500.00
rows=1000000 width=8) (actual time=129.028..208.936 rows=1000000.00 loops=1)
 Planning Time: 0.188 ms
 Execution Time: 713.738 ms
(7 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 10000;
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78938.56..78963.56 rows=10000 width=8) (actual
time=412.633..413.459 rows=10000.00 loops=1)
   Buffers: shared hit=3, temp read=1709 written=1709
   ->  Sort  (cost=78938.56..81438.56 rows=1000000 width=8) (actual
time=412.631..412.969 rows=10000.00 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 769kB
         Buffers: shared hit=3, temp read=1709 written=1709
         ->  Function Scan on generate_series i  (cost=0.00..12500.00
rows=1000000 width=8) (actual time=185.892..333.233 rows=1000000.00 loops=1)
               Buffers: temp read=1709 written=1709
 Planning:
   Buffers: shared hit=7 read=8
 Planning Time: 2.058 ms
 Execution Time: 416.040 ms
(12 rows)

# EXPLAIN ANALYZE SELECT * FROM (SELECT random() r FROM generate_series(1,
1000000) i) x ORDER BY x.r LIMIT 500000;
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=112157.85..113407.85 rows=500000 width=8) (actual
time=631.185..673.768 rows=500000.00 loops=1)
   ->  Sort  (cost=112157.85..114657.85 rows=1000000 width=8) (actual
time=631.183..649.072 rows=500000.00 loops=1)
         Sort Key: (random())
         Sort Method: quicksort  Memory: 24577kB
         ->  Function Scan on generate_series i  (cost=0.00..12500.00
rows=1000000 width=8) (actual time=121.274..200.453 rows=1000000.00 loops=1)
 Planning Time: 0.243 ms
 Execution Time: 698.841 ms
(7 rows)

I've another idea.  cost_tuplesort() puts 2.0 under logarithm to prefer
tuplesort over heapsort.  I think we can adjust cost_gather_merge() and
cost_merge_append() to do the same.  0001 patch implements that.  I think
the plan changes of 0001 might be reasonable since most cases deal with
small rowsets.  One thing concerns me: 0002 still affects one of the
postgres_fdw checks.  Could you, please, take a look?

------
Regards,
Alexander Korotkov
Supabase

v8-0001-Add-some-penatly-to-gather-merge-and-merge-append.patch
Description: Binary data

v8-0002-Consider-an-explicit-sort-of-the-MergeAppend-subp.patch
Description: Binary data

Re: MergeAppend could consider sorting cheapest child path

Reply via email to