Re: Disk-based hash aggregate's cost model

2020-09-08 Thread Tomas Vondra
On Mon, Sep 07, 2020 at 01:55:28PM -0700, Jeff Davis wrote: On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote: I've tested the costing changes on the simplified TPC-H query, on two different machines, and it seems like a clear improvement. Thank you. Committed. So yeah, the patched costi

Re: Disk-based hash aggregate's cost model

2020-09-07 Thread Jeff Davis
On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote: > I've tested the costing changes on the simplified TPC-H query, on two > different machines, and it seems like a clear improvement. Thank you. Committed. > So yeah, the patched costing is much closer to sort (from the point > of > this cost/

Re: Disk-based hash aggregate's cost model

2020-09-06 Thread Tomas Vondra
Hi, I've tested the costing changes on the simplified TPC-H query, on two different machines, and it seems like a clear improvement. This is using the same cost/duration measure, which I think is pretty neat way to look at this. Sure, it's imperfect (depends on which cost and durations you actua

Re: Disk-based hash aggregate's cost model

2020-09-04 Thread Jeff Davis
On Fri, 2020-09-04 at 21:01 +0200, Tomas Vondra wrote: > Wouldn't it be enough to just use a slot with smaller tuple > descriptor? > All we'd need to do is creating the descriptor in ExecInitAgg after > calling find_hash_columns, and using it for rslot/wslot, and then > "mapping" the attributes in

Re: Disk-based hash aggregate's cost model

2020-09-04 Thread Tomas Vondra
On Fri, Sep 04, 2020 at 11:31:36AM -0700, Jeff Davis wrote: On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote: Those charts show that the CP_SMALL_TLIST resulted in smaller temp files (per EXPLAIN ANALYZE the difference is ~25%) and also lower query durations (also in the ~25% range). I wa

Re: Disk-based hash aggregate's cost model

2020-09-04 Thread Jeff Davis
On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote: > Those charts show that the CP_SMALL_TLIST resulted in smaller temp > files > (per EXPLAIN ANALYZE the difference is ~25%) and also lower query > durations (also in the ~25% range). I was able to reproduce the problem, thank you. Only two at

Re: Disk-based hash aggregate's cost model

2020-09-04 Thread Tomas Vondra
On Thu, Sep 03, 2020 at 05:53:43PM -0700, Jeff Davis wrote: On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote: FWIW any thoughts about the different in temp size compared to CP_SMALL_TLIST? Are you referring to results from a while ago? In this thread I don't see what you're referring to.

Re: Disk-based hash aggregate's cost model

2020-09-03 Thread Jeff Davis
On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote: > FWIW any thoughts about the different in temp size compared to > CP_SMALL_TLIST? Are you referring to results from a while ago? In this thread I don't see what you're referring to. I tried in a simple case on REL_13_STABLE, with and without

Re: Disk-based hash aggregate's cost model

2020-09-03 Thread Jeff Davis
On Wed, 2020-09-02 at 17:35 -0700, Peter Geoghegan wrote: > On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis wrote: > > create table text10m(t text collate "C.UTF-8", i int, n numeric); > > insert into text10m select s.g::text, s.g, s.g::numeric from > > (select > > (random()*10)::int as g from g

Re: Disk-based hash aggregate's cost model

2020-09-02 Thread Peter Geoghegan
On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis wrote: > create table text10m(t text collate "C.UTF-8", i int, n numeric); > insert into text10m select s.g::text, s.g, s.g::numeric from (select > (random()*10)::int as g from generate_series(1,1000)) s; > explain analyze select distinct t fro

Re: Disk-based hash aggregate's cost model

2020-09-02 Thread Jeff Davis
On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: > So I'm wondering if the hashagg is not ignoring similar non-I/O costs > for the spilling case. In particular, the initial section computing > startup_cost seems to ignore that we may need to so some of the stuff > repeatedly - for example we'

Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Peter Geoghegan
On Tue, Sep 1, 2020 at 2:19 AM Tomas Vondra wrote: > FWIW I suspect some of this difference may be due to logical vs. > physical I/O. iosnoop only tracks physical I/O sent to the device, but > maybe we do much more logical I/O and it simply does not expire from > page cache for the sort. It might

Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Tomas Vondra
On Tue, Sep 01, 2020 at 12:58:59PM -0700, Jeff Davis wrote: On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote: Why? I don't think we need to change costing of in-memory HashAgg. My assumption was we'd only tweak startup_cost for cases with spilling by adding something like (cpu_operator_cost

Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Jeff Davis
On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote: > Why? I don't think we need to change costing of in-memory HashAgg. My > assumption was we'd only tweak startup_cost for cases with spilling > by > adding something like (cpu_operator_cost * npartitions * ntuples). The code above (the in-memo

Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Tomas Vondra
On Mon, Aug 31, 2020 at 11:34:34PM -0700, Jeff Davis wrote: On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: So I'm wondering if the hashagg is not ignoring similar non-I/O costs for the spilling case. In particular, the initial section computing startup_cost seems to ignore that we may ne

Re: Disk-based hash aggregate's cost model

2020-08-31 Thread Jeff Davis
On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: > So I'm wondering if the hashagg is not ignoring similar non-I/O costs > for the spilling case. In particular, the initial section computing > startup_cost seems to ignore that we may need to so some of the stuff > repeatedly - for example we'

Re: Disk-based hash aggregate's cost model

2020-08-30 Thread Tomas Vondra
On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote: On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: We have a Postgres 13 open item for Disk-based hash aggregate, which is the only non-trivial open item. There is a

Re: Disk-based hash aggregate's cost model

2020-08-29 Thread Tomas Vondra
On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: We have a Postgres 13 open item for Disk-based hash aggregate, which is the only non-trivial open item. There is a general concern about how often we get disk-based hash aggregat

Re: Disk-based hash aggregate's cost model

2020-08-28 Thread Jeff Davis
On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: > We have a Postgres 13 open item for Disk-based hash aggregate, which > is the only non-trivial open item. There is a general concern about > how often we get disk-based hash aggregation when work_mem is > particularly low, and recursion se