Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra Sun, 24 May 2020 19:11:48 -0700

On Thu, May 21, 2020 at 11:41:22PM +0200, Tomas Vondra wrote:

On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:

On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.


Attached a new version that uses the doubling behavior, and cleans it
up a bit. It also returns the unused prealloc blocks back to lts-
freeBlocks when the tape is rewound for reading.


Ah, the returning is a nice idea, that should limit the overhead quite a
bit, I think.

IIUC the danger of pre-allocating blocks is that we might not fill
them,
resulting in temp file much larger than necessary. It might be
harmless
on some (most?) current filesystems that don't actually allocate
space
for blocks that are never written, but it also confuses our
accounting
of temporary file sizes. So we should try to limit that, and growing
the
number of pre-allocated blocks over time seems reasonable.


There's another danger here: it doesn't matter how well the filesystem
deals with sparse writes, because ltsWriteBlock fills in the holes with
zeros anyway. That's potentially a significant amount of wasted IO
effort if we aren't careful.


True. I'll give it a try on both machines and report some numbers. Might
take a couple of days.


OK, so I do have some numbers to share. I think there's a clear
conclusion that the two patches are a huge improvement, but there's also
something fishy about planning of parallel queries.

Firstly, I have two machines that I used for testing:

1) small one: i5-2500k (4 cores), 8GB RAM, SSD RAID for data, SSD for
temporary tablespace, using TPC-H 32GB data set

2) big one: 2x xeon e5-2620v3 (8 cores), 64GB RAM, NVME SSD for data,
temporary tablespace on SATA RAID0 (3 x 7.2k), using TPC-H 75GB


serial queries (no parallelism)
===============================

Results with parallel query disabled on the two machines look like this:

1) small one (SSD)

    algorithm  master  prealloc  tlist  prealloc-tlist
    --------------------------------------------------
         hash    1365       437    368             213
         sort     226       214    224             215

The sort row simply means "enable_hashagg = off" and AFAIK the patches
should not have a lot of influence here - the prealloc does, but it's
fairly negligible.

It's not always exactly on part, I've seen cases where hash or sort were
a bit faster (probably depending on work_mem), but I think we can ignore
that for now.


2) big one (SATA)

    algorithm  master  tlist  prealloc  prealloc+tlist
    --------------------------------------------------
         hash   25534   5120      2402             540
         sort     460    460       465             485

The effect is even more pronounced, thanks to poor handling of random
I/O by the SATA RAID device. It's not exactly on par with sort, but it's
close enough ...


parallel queries
================

And now the fun begins ...


1) small one (SSD, max_parallel_workers_per_gather = 2)

    algorithm  master  tlist  prealloc  prealloc+tlist
    --------------------------------------------------
         hash   693      390       177             128
         sort   103       99       101              99

This looks pretty nice - the patches have the expected effect, it got
faster than with just a single CPU etc.


2) big one (SATA, max_parallel_workers_per_gather = 16)

    algorithm  master  tlist  prealloc  prealloc+tlist
    --------------------------------------------------
         hash       ?  25000         ?            3132
         sort     248    234       216             200

Well, not that nice :-( The hash queries take so much time that I've
decided not to wait for them and the two numbers are actually just
estimates (after processing just a couple of logical tapes).

Plus it actually gets slower than with serial execution, so what's the
problem here? Especially considering it worked OK on the small machine?

At first I thought it's something about SSD vs. SATA, but it seems to be
more about how we construct the plans, because the plans between the two
machines are very different. And it seems to be depend by the number of
workers per gather - for low number of workers the plan looks like this
(the plans are attached in plans.txt in case the formatting gets broken
by your client):


                                                      QUERY PLAN
    
---------------------------------------------------------------------------------------------------------------
     Limit
       ->  Aggregate
             ->  Hash Join
                   Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                   Join Filter: (lineitem.l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
                   ->  Gather
                         Workers Planned: 2
                         ->  Nested Loop
                               ->  Parallel Seq Scan on part
                                     Filter: ((p_brand = 'Brand#22'::bpchar) 
AND (p_container = 'LG BOX'::bpchar))
                               ->  Index Scan using idx_lineitem_part_supp on 
lineitem
                                     Index Cond: (l_partkey = part.p_partkey)
                   ->  Hash
                         ->  Finalize HashAggregate
                               Group Key: lineitem_1.l_partkey
                               ->  Gather
                                     Workers Planned: 2
                                     ->  Partial HashAggregate
                                           Group Key: lineitem_1.l_partkey
                                           ->  Parallel Seq Scan on lineitem 
lineitem_1
    (20 rows)

but then if I crank the number of workers up, it switches to this:

                                                         QUERY PLAN
    
---------------------------------------------------------------------------------------------------------------------
     Limit
       ->  Finalize Aggregate
             ->  Gather
                   Workers Planned: 5
                   ->  Partial Aggregate
                         ->  Nested Loop
                               Join Filter: (part.p_partkey = 
lineitem.l_partkey)
                               ->  Hash Join
                                     Hash Cond: (part.p_partkey = 
lineitem_1.l_partkey)
                                     ->  Parallel Seq Scan on part
                                           Filter: ((p_brand = 
'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                     ->  Hash
                                           ->  HashAggregate
                                                 Group Key: lineitem_1.l_partkey
                                                 ->  Seq Scan on lineitem 
lineitem_1
                               ->  Index Scan using idx_lineitem_part_supp on 
lineitem
                                     Index Cond: (l_partkey = 
lineitem_1.l_partkey)
                                     Filter: (l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
    (18 rows)


Notice that in the first plan, the hashagg is on top of parallel-aware
path - so each workers builds hashagg only on a subset of data, and also
spills only a fraction of the input rows (so that all workers combined
spill rouhly the "whole" table).

In the second plan, the hashagg is on the non-partitioned side of the
join, so each workers builds a hash aggregate on the *whole* set of
input rows. Which means that (a) we need much more disk space for temp
files, making it unlikely to fit into page cache and (b) there's a lot
of contention for I/O, making it much more random.

Now, I haven't seen the second plan with sort-based aggregation, no
matter how I set the number of workers it always looks like this:

                                                         QUERY PLAN
    
---------------------------------------------------------------------------------------------------------------------
     Limit
       ->  Aggregate
             ->  Merge Join
                   Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
                   Join Filter: (lineitem.l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
                   ->  Finalize GroupAggregate
                         Group Key: lineitem_1.l_partkey
                         ->  Gather Merge
                               Workers Planned: 8
                               ->  Partial GroupAggregate
                                     Group Key: lineitem_1.l_partkey
                                     ->  Sort
                                           Sort Key: lineitem_1.l_partkey
                                           ->  Parallel Seq Scan on lineitem 
lineitem_1
                   ->  Materialize
                         ->  Gather Merge
                               Workers Planned: 6
                               ->  Nested Loop
                                     ->  Parallel Index Scan using part_pkey on 
part
                                           Filter: ((p_brand = 
'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                     ->  Index Scan using 
idx_lineitem_part_supp on lineitem
                                           Index Cond: (l_partkey = 
part.p_partkey)
    (22 rows)

How come we don't have the same issue here? Is there something in the
optimizer that prevents us from creating the "silly" plans with
groupagg, and we should do the same thing for hashagg?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

                                                  QUERY PLAN                    
                               
---------------------------------------------------------------------------------------------------------------
 Limit
   ->  Aggregate
         ->  Hash Join
               Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
               Join Filter: (lineitem.l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
               ->  Gather
                     Workers Planned: 2
                     ->  Nested Loop
                           ->  Parallel Seq Scan on part
                                 Filter: ((p_brand = 'Brand#22'::bpchar) AND 
(p_container = 'LG BOX'::bpchar))
                           ->  Index Scan using idx_lineitem_part_supp on 
lineitem
                                 Index Cond: (l_partkey = part.p_partkey)
               ->  Hash
                     ->  Finalize HashAggregate
                           Group Key: lineitem_1.l_partkey
                           ->  Gather
                                 Workers Planned: 2
                                 ->  Partial HashAggregate
                                       Group Key: lineitem_1.l_partkey
                                       ->  Parallel Seq Scan on lineitem 
lineitem_1
(20 rows)


                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Limit
   ->  Finalize Aggregate
         ->  Gather
               Workers Planned: 5
               ->  Partial Aggregate
                     ->  Nested Loop
                           Join Filter: (part.p_partkey = lineitem.l_partkey)
                           ->  Hash Join
                                 Hash Cond: (part.p_partkey = 
lineitem_1.l_partkey)
                                 ->  Parallel Seq Scan on part
                                       Filter: ((p_brand = 'Brand#22'::bpchar) 
AND (p_container = 'LG BOX'::bpchar))
                                 ->  Hash
                                       ->  HashAggregate
                                             Group Key: lineitem_1.l_partkey
                                             ->  Seq Scan on lineitem lineitem_1
                           ->  Index Scan using idx_lineitem_part_supp on 
lineitem
                                 Index Cond: (l_partkey = lineitem_1.l_partkey)
                                 Filter: (l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
(18 rows)


                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Limit
   ->  Aggregate
         ->  Merge Join
               Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
               Join Filter: (lineitem.l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
               ->  Finalize GroupAggregate
                     Group Key: lineitem_1.l_partkey
                     ->  Gather Merge
                           Workers Planned: 8
                           ->  Partial GroupAggregate
                                 Group Key: lineitem_1.l_partkey
                                 ->  Sort
                                       Sort Key: lineitem_1.l_partkey
                                       ->  Parallel Seq Scan on lineitem 
lineitem_1
               ->  Materialize
                     ->  Gather Merge
                           Workers Planned: 6
                           ->  Nested Loop
                                 ->  Parallel Index Scan using part_pkey on part
                                       Filter: ((p_brand = 'Brand#22'::bpchar) 
AND (p_container = 'LG BOX'::bpchar))
                                 ->  Index Scan using idx_lineitem_part_supp on 
lineitem
                                       Index Cond: (l_partkey = part.p_partkey)
(22 rows)


                                                            QUERY PLAN          
                                                   
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9323493.08..9323493.10 rows=1 width=32)
   ->  Aggregate  (cost=9323493.08..9323493.10 rows=1 width=32)
         ->  Hash Join  (cost=8690894.07..9323348.15 rows=57973 width=8)
               Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
               Join Filter: (lineitem.l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
               ->  Gather  (cost=1000.57..587448.17 rows=201706 width=21)
                     Workers Planned: 2
                     ->  Nested Loop  (cost=0.57..566277.57 rows=84044 width=21)
                           ->  Parallel Seq Scan on part  (cost=0.00..171069.18 
rows=2802 width=4)
                                 Filter: ((p_brand = 'Brand#22'::bpchar) AND 
(p_container = 'LG BOX'::bpchar))
                           ->  Index Scan using idx_lineitem_part_supp on 
lineitem  (cost=0.57..140.70 rows=35 width=17)
                                 Index Cond: (l_partkey = part.p_partkey)
               ->  Hash  (cost=8577801.28..8577801.28 rows=5518338 width=36)
                     ->  Finalize HashAggregate  (cost=8353618.79..8522617.90 
rows=5518338 width=36)
                           Group Key: lineitem_1.l_partkey
                           Planned Partitions: 16
                           ->  Gather  (cost=6362701.20..7925947.60 
rows=11036676 width=36)
                                 Workers Planned: 2
                                 ->  Partial HashAggregate  
(cost=6361701.20..6821280.00 rows=5518338 width=36)
                                       Group Key: lineitem_1.l_partkey
                                       Planned Partitions: 16
                                       ->  Parallel Seq Scan on lineitem 
lineitem_1  (cost=0.00..4399328.93 rows=79994793 width=9)
(22 rows)

                                                           QUERY PLAN           
                                                 
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=11757514.10..11757514.11 rows=1 width=32)
   ->  Finalize Aggregate  (cost=11757514.10..11757514.11 rows=1 width=32)
         ->  Gather  (cost=11757513.56..11757514.07 rows=5 width=32)
               Workers Planned: 5
               ->  Partial Aggregate  (cost=11756513.56..11756513.57 rows=1 
width=32)
                     ->  Nested Loop  (cost=11416439.73..11756484.57 rows=11595 
width=8)
                           Join Filter: (part.p_partkey = lineitem.l_partkey)
                           ->  Hash Join  (cost=11416439.16..11609836.77 
rows=1160 width=40)
                                 Hash Cond: (part.p_partkey = 
lineitem_1.l_partkey)
                                 ->  Parallel Seq Scan on part  
(cost=0.00..150269.09 rows=1345 width=4)
                                       Filter: ((p_brand = 'Brand#22'::bpchar) 
AND (p_container = 'LG BOX'::bpchar))
                                 ->  Hash  (cost=11304346.93..11304346.93 
rows=5518338 width=36)
                                       ->  HashAggregate  
(cost=10228949.50..11249163.55 rows=5518338 width=36)
                                             Group Key: lineitem_1.l_partkey
                                             Planned Partitions: 16
                                             ->  Seq Scan on lineitem 
lineitem_1  (cost=0.00..5519256.04 rows=191987504 width=9)
                           ->  Index Scan using idx_lineitem_part_supp on 
lineitem  (cost=0.57..126.27 rows=12 width=17)
                                 Index Cond: (l_partkey = lineitem_1.l_partkey)
                                 Filter: (l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
(19 rows)

                                                            QUERY PLAN          
                                                   
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13805808.43..13805808.44 rows=1 width=32)
   ->  Aggregate  (cost=13805808.43..13805808.44 rows=1 width=32)
         ->  Merge Join  (cost=7193277.61..13805663.49 rows=57973 width=8)
               Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
               Join Filter: (lineitem.l_quantity < ((0.2 * 
avg(lineitem_1.l_quantity))))
               ->  Finalize GroupAggregate  (cost=7192276.51..13300598.75 
rows=5518338 width=36)
                     Group Key: lineitem_1.l_partkey
                     ->  Gather Merge  (cost=7192276.51..12886723.40 
rows=44146704 width=36)
                           Workers Planned: 8
                           ->  Partial GroupAggregate  
(cost=7191276.37..7440243.88 rows=5518338 width=36)
                                 Group Key: lineitem_1.l_partkey
                                 ->  Sort  (cost=7191276.37..7251272.46 
rows=23998438 width=9)
                                       Sort Key: lineitem_1.l_partkey
                                       ->  Parallel Seq Scan on lineitem 
lineitem_1  (cost=0.00..3839365.38 rows=23998438 width=9)
               ->  Materialize  (cost=1001.10..433407.27 rows=201706 width=21)
                     ->  Gather Merge  (cost=1001.10..432903.01 rows=201706 
width=21)
                           Workers Planned: 6
                           ->  Nested Loop  (cost=1.00..407388.21 rows=33618 
width=21)
                                 ->  Parallel Index Scan using part_pkey on 
part  (cost=0.43..249276.65 rows=1121 width=4)
                                       Filter: ((p_brand = 'Brand#22'::bpchar) 
AND (p_container = 'LG BOX'::bpchar))
                                 ->  Index Scan using idx_lineitem_part_supp on 
lineitem  (cost=0.57..140.70 rows=35 width=17)
                                       Index Cond: (l_partkey = part.p_partkey)
(22 rows)

Re: Trouble with hashagg spill I/O pattern and costing

Reply via email to