Re: Disk-based hash aggregate's cost model

Jeff Davis Wed, 02 Sep 2020 17:19:11 -0700

On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:
> So I'm wondering if the hashagg is not ignoring similar non-I/O costs
> for the spilling case. In particular, the initial section computing
> startup_cost seems to ignore that we may need to so some of the stuff
> repeatedly - for example we'll repeat hash lookups for spilled
> tuples,
> and so on.


I tried to analyze this using a slightly different approach: cost units
per second of runtime. Obviously this will vary based on the machine,
but it's interesting anyway.

All of the following fit in system memory. Schema, data, and queries
are at the end of this email.

A low value of cost-units/second-runtime means "more likely to be
chosen incorrectly" and a high value means "more likely to be missed
incorrectly".

Plan    | work_mem | 10M  | 100M INT4    | 100M | 10M
        |          | INT4 | (10M groups) | INT4 | TEXT
--------+----------+------+--------------+------+-----
HashAgg | 4MB      |  88  |  63          |  82  | 78
HashAgg | 1TB      |  41  |  37          |  33  | 38
Sort    | 4MB      | 182  | 188          | 174  | 37
Sort    | 1TB      | 184  | 231          | 189  | 30


Sort is consistent for a given datatype, but it seems that the
comparison cost is far from proportionate between data types.

HashAgg is consistent between the data types, but the disk costs play a
larger role (in this case, there is no actual disk access to worry
about, because it fits in system memory).

At least for these simple queries, Sort is punished unfairly for INT4,
but gets an unfair boost for TEXT.

It seems that we should make a change here, but to be conservative for
13, we should limit changes to the new plans, which are the first line
(HashAgg that spills).

The attached patch makes two adjustments: a 2X disk penalty for
HashAgg, and I also add:

  spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost

The new results:

Plan    | work_mem | 10M  | 100M INT4    | 100M | 10M
        |          | INT4 | (10M groups) | INT4 | TEXT
--------+----------+------+--------------+------+-----
HashAgg | 4MB      | 192  | 131          | 178  | 166


That's much more comparable to Sort for INT4, but makes the gap wider
for TEXT. Fortunately, at least for my simple queries, it just barely
avoids a plan change to Sort for the TEXT case (which is nearly 5X
slower than HashAgg).

I think this approach is reasonable for 13: it only changes the costing
for HashAgg plans that are expected to spill, it's fairly conservative
so we will not get lots of new bad plans, and it still seems to use
HashAgg for cases where it clearly should.

Note: in-memory HashAgg is still unfairly favored over Sort, at least
for these cases.

Regards,
        Jeff Davis


create table t10m(i int);
insert into t10m select (random()*1000000000)::int from
generate_series(1,10000000);
explain analyze select distinct i from t10m;

create table t100m(i int);
insert into t100m select (random()*2000000000)::int from
generate_series(1,100000000);
explain analyze select distinct i from t100m;

-- 100m tuples, 10m groups, 10tuples/group
create table t100m_10(i int);
insert into t100m_10 select (random()*10000000)::int from
generate_series(1,100000000);
explain analyze select distinct i from t100m_10;

create table text10m(t text collate "C.UTF-8", i int, n numeric);
insert into text10m select s.g::text, s.g, s.g::numeric from (select
(random()*1000000000)::int as g from generate_series(1,10000000)) s;
explain analyze select distinct t from text10m;

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 104e779f6ac..f39e6a9f80d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2416,6 +2416,7 @@ cost_agg(Path *path, PlannerInfo *root,
 		double		pages;
 		double		pages_written = 0.0;
 		double		pages_read = 0.0;
+		double		spill_cost;
 		double		hashentrysize;
 		double		nbatches;
 		Size		mem_limit;
@@ -2453,9 +2454,21 @@ cost_agg(Path *path, PlannerInfo *root,
 		pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
 		pages_written = pages_read = pages * depth;
 
+		/*
+		 * HashAgg has somewhat worse IO behavior than Sort on typical
+		 * hardware/OS combinations. Account for this with a generic penalty.
+		 */
+		pages_read *= 2.0;
+		pages_written *= 2.0;
+
 		startup_cost += pages_written * random_page_cost;
 		total_cost += pages_written * random_page_cost;
 		total_cost += pages_read * seq_page_cost;
+
+		/* account for CPU cost of spilling a tuple and reading it back */
+		spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost;
+		startup_cost += spill_cost;
+		total_cost += spill_cost;
 	}
 
 	/*

Re: Disk-based hash aggregate's cost model

Reply via email to