On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: > So I'm wondering if the hashagg is not ignoring similar non-I/O costs > for the spilling case. In particular, the initial section computing > startup_cost seems to ignore that we may need to so some of the stuff > repeatedly - for example we'll repeat hash lookups for spilled > tuples, > and so on.
I tried to analyze this using a slightly different approach: cost units per second of runtime. Obviously this will vary based on the machine, but it's interesting anyway. All of the following fit in system memory. Schema, data, and queries are at the end of this email. A low value of cost-units/second-runtime means "more likely to be chosen incorrectly" and a high value means "more likely to be missed incorrectly". Plan | work_mem | 10M | 100M INT4 | 100M | 10M | | INT4 | (10M groups) | INT4 | TEXT --------+----------+------+--------------+------+----- HashAgg | 4MB | 88 | 63 | 82 | 78 HashAgg | 1TB | 41 | 37 | 33 | 38 Sort | 4MB | 182 | 188 | 174 | 37 Sort | 1TB | 184 | 231 | 189 | 30 Sort is consistent for a given datatype, but it seems that the comparison cost is far from proportionate between data types. HashAgg is consistent between the data types, but the disk costs play a larger role (in this case, there is no actual disk access to worry about, because it fits in system memory). At least for these simple queries, Sort is punished unfairly for INT4, but gets an unfair boost for TEXT. It seems that we should make a change here, but to be conservative for 13, we should limit changes to the new plans, which are the first line (HashAgg that spills). The attached patch makes two adjustments: a 2X disk penalty for HashAgg, and I also add: spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost The new results: Plan | work_mem | 10M | 100M INT4 | 100M | 10M | | INT4 | (10M groups) | INT4 | TEXT --------+----------+------+--------------+------+----- HashAgg | 4MB | 192 | 131 | 178 | 166 That's much more comparable to Sort for INT4, but makes the gap wider for TEXT. Fortunately, at least for my simple queries, it just barely avoids a plan change to Sort for the TEXT case (which is nearly 5X slower than HashAgg). I think this approach is reasonable for 13: it only changes the costing for HashAgg plans that are expected to spill, it's fairly conservative so we will not get lots of new bad plans, and it still seems to use HashAgg for cases where it clearly should. Note: in-memory HashAgg is still unfairly favored over Sort, at least for these cases. Regards, Jeff Davis create table t10m(i int); insert into t10m select (random()*1000000000)::int from generate_series(1,10000000); explain analyze select distinct i from t10m; create table t100m(i int); insert into t100m select (random()*2000000000)::int from generate_series(1,100000000); explain analyze select distinct i from t100m; -- 100m tuples, 10m groups, 10tuples/group create table t100m_10(i int); insert into t100m_10 select (random()*10000000)::int from generate_series(1,100000000); explain analyze select distinct i from t100m_10; create table text10m(t text collate "C.UTF-8", i int, n numeric); insert into text10m select s.g::text, s.g, s.g::numeric from (select (random()*1000000000)::int as g from generate_series(1,10000000)) s; explain analyze select distinct t from text10m;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 104e779f6ac..f39e6a9f80d 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2416,6 +2416,7 @@ cost_agg(Path *path, PlannerInfo *root, double pages; double pages_written = 0.0; double pages_read = 0.0; + double spill_cost; double hashentrysize; double nbatches; Size mem_limit; @@ -2453,9 +2454,21 @@ cost_agg(Path *path, PlannerInfo *root, pages = relation_byte_size(input_tuples, input_width) / BLCKSZ; pages_written = pages_read = pages * depth; + /* + * HashAgg has somewhat worse IO behavior than Sort on typical + * hardware/OS combinations. Account for this with a generic penalty. + */ + pages_read *= 2.0; + pages_written *= 2.0; + startup_cost += pages_written * random_page_cost; total_cost += pages_written * random_page_cost; total_cost += pages_read * seq_page_cost; + + /* account for CPU cost of spilling a tuple and reading it back */ + spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost; + startup_cost += spill_cost; + total_cost += spill_cost; } /*