On Thu, May 21, 2020 at 03:41:22PM +0200, Tomas Vondra wrote:
On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:

...

The problem is that the hashagg plan runs in ~1400 seconds, while the
groupagg only takes ~360. And per explain analyze, the difference really
is in the aggregation - if we subtract the seqscan, the sort+groupagg
takes about 310s:

  ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) 
(actual time=283378.004..335611.192 rows=6398981 loops=1)
        Group Key: lineitem_1.l_partkey
        ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) 
(actual time=283377.977..306182.393 rows=191969841 loops=1)
              Sort Key: lineitem_1.l_partkey
              Sort Method: external merge  Disk: 3569544kB
              ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 
rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1)

while the hashagg takes ~1330s:

  ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) 
(actual time=202952.170..1354546.897 rows=6400000 loops=1)
        Group Key: lineitem_1.l_partkey
        Planned Partitions: 128
        Peak Memory Usage: 4249 kB
        Disk Usage: 26321840 kB
        HashAgg Batches: 16512
        ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 
rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1)

And that's while only writing 26GB, compared to 35GB in the sorted plan,
and with cost being ~16M vs. ~43M (so roughly inverse).


I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
writes out *way less* data to the temp file.

The reason is revealed by explain verbose:

 ->  GroupAggregate
       Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
       Group Key: lineitem_1.l_partkey
       ->  Sort
             Output: lineitem_1.l_partkey, lineitem_1.l_quantity
             Sort Key: lineitem_1.l_partkey
             ->  Seq Scan on public.lineitem lineitem_1
                   Output: lineitem_1.l_partkey, lineitem_1.l_quantity

 ->  HashAggregate
       Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
       Group Key: lineitem_1.l_partkey
       ->  Seq Scan on public.lineitem lineitem_1
             Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
                     lineitem_1.l_suppkey, lineitem_1.l_linenumber,
                     lineitem_1.l_quantity, lineitem_1.l_extendedprice,
                     lineitem_1.l_discount, lineitem_1.l_tax,
                     lineitem_1.l_returnflag, lineitem_1.l_linestatus,
                     lineitem_1.l_shipdate, lineitem_1.l_commitdate,
                     lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
                     lineitem_1.l_shipmode, lineitem_1.l_comment

It seems that in the hashagg case we're not applying projection in the
seqscan, forcing us to serialize way much data (the whole lineitem
table, essentially).

It's probably still worth tweaking the I/O pattern, I think.


OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to
CP_SMALL_TLIST) addresses this for me. I've only tried it on the patched
version that pre-allocates 128 blocks, and the results seem pretty nice:

               sort      hash      hash+tlist
   ------------------------------------------
      4MB       331       478            188
    128MB       222       434            210

which I guess is what we wanted ...

I'll give it a try on the other machine (temp on SATA), but I don't see
why would it not behave similarly nicely.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 9941dfe65e..08d43c270e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2118,7 +2118,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
         * Agg can project, so no need to be terribly picky about child tlist, 
but
         * we do need grouping columns to be available
         */
-       subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST);
+       subplan = create_plan_recurse(root, best_path->subpath, CP_SMALL_TLIST);
 
        tlist = build_path_tlist(root, &best_path->path);
 

Reply via email to