On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: > We have a Postgres 13 open item for Disk-based hash aggregate, which > is the only non-trivial open item. There is a general concern about > how often we get disk-based hash aggregation when work_mem is > particularly low, and recursion seems unavoidable. This is generally > thought to be a problem in the costing.
We discussed two approaches to tweaking the cost model: 1. Penalize HashAgg disk costs by a constant amount. It seems to be chosen a little too often, and we can reduce the number of plan changes. 2. Treat recursive HashAgg spilling skeptically, and heavily penalize recursive spilling. The problem with approach #2 is that we have a default hash mem of 4MB, and real systems have a lot more than that. In this scenario, recursive spilling can beat Sort by a lot. For instance: Data: create table text10m(t text collate "C.UTF-8", i int, n numeric); insert into t10m select s.g::text, s.g, s.g::numeric from ( select (random()*1000000000)::int as g from generate_series(1,10000000)) s; vacuum (freeze,analyze) text10m; Query: explain analyze select distinct t from text10m; HashAgg: 10.5s Sort+Distinct: 46s I'm inclined toward option #1 for simplicity unless you feel strongly about option #2. Specifically, I was thinking of a 1.5X penalty for HashAgg disk costs. Regards, Jeff Davis