On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote:
On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:
On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
We have a Postgres 13 open item for Disk-based hash aggregate, which
is the only non-trivial open item. There is a general concern about
how often we get disk-based hash aggregation when work_mem is
particularly low, and recursion seems unavoidable. This is generally
thought to be a problem in the costing.

We discussed two approaches to tweaking the cost model:

1. Penalize HashAgg disk costs by a constant amount. It seems to be
chosen a little too often, and we can reduce the number of plan
changes.

2. Treat recursive HashAgg spilling skeptically, and heavily penalize
recursive spilling.

The problem with approach #2 is that we have a default hash mem of 4MB,
and real systems have a lot more than that. In this scenario, recursive
spilling can beat Sort by a lot.


I think the main issue is that we're mostly speculating what's wrong.
I've shared some measurements and symptoms, and we've discussed what
might be causing it, but I'm not really sure we know for sure.

I really dislike (1) because it seems more like "We don't know what's
wrong so we'll penalize hashagg," kind of solution. A much more
principled solution would be to tweak the costing accordingly, not just
by adding some constant. For (2) it really depends if recursive spilling
is really the problem here. In the examples I shared, the number of
partitions/batches was very different, but the query duration was
mostly independent (almost constant).


I've decided to look at the costing a bit more closely today, and see
why the costing is so much lower compared to sort/groupagg. I've used
the same 32GB dataset and query as in [1].

I've repeated tests for all the work_mem values, and I see the number of
batches are much lower, probably thanks to the HLL improvement:

      2MB    Planned:  64    Batches (old):  4160    Batches:  2977
      4MB    Planned: 128    Batches (old): 16512    Batches:  1569
      8MB    Planned: 256    Batches (old): 21488    Batches:  1289
     64MB    Planned:  32    Batches (old):  2720    Batches:   165
    256MB    Planned:   8    Batches (old):     8    Batches:    41

The impact on duration of the queries seems pretty negligible, though.


The plans with work_mem=64MB look like this:

1) hashagg

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=11267293.86..11267293.86 rows=1 width=36) (actual 
time=213127.515..213127.517 rows=0 loops=1)
   ->  HashAggregate  (cost=10229061.10..11267293.86 rows=6718533 width=36) 
(actual time=100862.623..212948.642 rows=6400000 loops=1)
         Group Key: lineitem.l_partkey
         Planned Partitions: 32  Batches: 165  Memory Usage: 67857kB  Disk 
Usage: 6802432kB
         ->  Seq Scan on lineitem  (cost=0.00..5519288.36 rows=191990736 
width=9) (actual time=0.418..20344.631 rows=192000551 loops=1)
 Planning Time: 0.064 ms
 Execution Time: 213441.986 ms
(7 rows)

2) groupagg

                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=36755617.81..36755617.81 rows=1 width=36) (actual 
time=180029.594..180029.595 rows=0 loops=1)
   ->  GroupAggregate  (cost=35214909.30..36755617.81 rows=6718533 width=36) 
(actual time=94017.788..179857.683 rows=6400000 loops=1)
         Group Key: lineitem.l_partkey
         ->  Sort  (cost=35214909.30..35694886.14 rows=191990736 width=9) 
(actual time=94017.750..151138.727 rows=192000551 loops=1)
               Sort Key: lineitem.l_partkey
               Sort Method: external merge  Disk: 3742208kB
               ->  Seq Scan on lineitem  (cost=0.00..5519288.36 rows=191990736 
width=9) (actual time=0.008..26831.264 rows=192000551 loops=1)
 Planning Time: 0.063 ms
 Execution Time: 180209.580 ms
(9 rows)


I still don't understand why the hashagg is costed so low compared to
the sort (only about 1/3 of it), because both cases use exactly the same
estimates internally. cost_tuplesort ends up with

    npages = 937455
    nruns  = 114.435396
    input_bytes = 7679629440
    log_runs = 1.0

while cost_agg uses

    pages_read = 937455
    pages_written = 937455
    relation_size = 7679629440;

yet we end up with much lower estimate for hashagg. It however does seem
to me this is mostly due to non-I/O costs, considered by cost_tuplesort
and perhaps ignored by cost_agg? In particular, most of the sort cost
comes from this

    *startup_cost = comparison_cost * tuples * LOG2(tuples);

So I'm wondering if the hashagg is not ignoring similar non-I/O costs
for the spilling case. In particular, the initial section computing
startup_cost seems to ignore that we may need to so some of the stuff
repeatedly - for example we'll repeat hash lookups for spilled tuples,
and so on.

The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same. For hashagg
the iosnoop shows 5921MB reads and 7185MB writes, while sort only does
2895MB reads and 3655MB writes. Which kinda matches the observed sizes
of temp files in the two cases, so the input_bytes for sort seems to be
a bit overestimated.


regards

[1] 
https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to