Re: Memory-Bounded Hash Aggregation

2020-03-28 Thread Jeff Davis
On Fri, 2020-03-27 at 02:31 +0100, Tomas Vondra wrote: > On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote: > > If nbatches is some number between 1.0 and 2.0, we would have a > > negative > > depth. As a result, we may have a negative cost for hash > > aggregation > > plan node, as descr

Re: Memory-Bounded Hash Aggregation

2020-03-26 Thread Tomas Vondra
On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote: Hello, When calculating the disk costs of hash aggregation that spills to disk, there is something wrong with how we determine depth: depth = ceil( log(nbatches - 1) / log(num_partitions) ); If nbatches is some number be

Re: Memory-Bounded Hash Aggregation

2020-03-26 Thread Richard Guo
Hello, When calculating the disk costs of hash aggregation that spills to disk, there is something wrong with how we determine depth: >depth = ceil( log(nbatches - 1) / log(num_partitions) ); If nbatches is some number between 1.0 and 2.0, we would have a negative depth. As a result,

Re: Memory-Bounded Hash Aggregation

2020-03-19 Thread Pengzhou Tang
On Fri, Mar 20, 2020 at 1:20 PM Pengzhou Tang wrote: > Hi, > > I happen to notice that "set enable_sort to false" cannot guarantee the > planner to use hashagg in test groupingsets.sql, > the following comparing results of sortagg and hashagg seems to have no > meaning. > > Please forget my comme

Re: Memory-Bounded Hash Aggregation

2020-03-19 Thread Pengzhou Tang
Hi, I happen to notice that "set enable_sort to false" cannot guarantee the planner to use hashagg in test groupingsets.sql, the following comparing results of sortagg and hashagg seems to have no meaning. Thanks, Pengzhou On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis wrote: > > Committed. > > Th

Re: Memory-Bounded Hash Aggregation

2020-03-18 Thread Justin Pryzby
On Sun, Mar 15, 2020 at 04:05:37PM -0700, Jeff Davis wrote: > > + if (from_tape) > > + partition_mem += HASHAGG_READ_BUFFER_SIZE; > > + partition_mem = npartitions * HASHAGG_WRITE_BUFFER_SIZE; > > > > => That looks wrong ; should say += ? > > Good catch! Fixed. > +++ b/src/backend/

Re: Memory-Bounded Hash Aggregation

2020-03-18 Thread Tomas Vondra
On Wed, Mar 18, 2020 at 04:35:57PM -0700, Jeff Davis wrote: Committed. \o/ -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory-Bounded Hash Aggregation

2020-03-18 Thread Jeff Davis
Committed. There's some future work that would be nice (some of these are just ideas and may not be worth it): * Refactor MemoryContextMemAllocated() to be a part of MemoryContextStats(), but allow it to avoid walking through the blocks and freelists. * Improve the choice of the initial number

Re: Memory-Bounded Hash Aggregation

2020-03-15 Thread Jeff Davis
On Thu, 2020-03-12 at 16:01 -0500, Justin Pryzby wrote: > I don't understand what's meant by "the chosen plan". > Should it say, "at execution ..." instead of "execution time" ? I removed that wording; hopefully it's more clear without it? > Either remove "plan types" for consistency with > enabl

Re: Memory-Bounded Hash Aggregation

2020-03-12 Thread Justin Pryzby
On Wed, Mar 11, 2020 at 11:55:35PM -0700, Jeff Davis wrote: > * tweaked EXPLAIN output some more > Unless I (or someone else) finds something significant, this is close > to commit. Thanks for working on this ; I finally made a pass over the patch. +++ b/doc/src/sgml/config.sgml + enable_gr

Re: Memory-Bounded Hash Aggregation

2020-03-11 Thread Jeff Davis
On Wed, 2020-02-26 at 19:14 -0800, Jeff Davis wrote: > Rebased on your change. This simplified the JIT and interpretation > code > quite a bit. Attached another version. * tweaked EXPLAIN output some more * rebased and cleaned up * Added back the enable_hashagg_disk flag (defaulting to on). I'

Re: Memory-Bounded Hash Aggregation

2020-02-26 Thread Jeff Davis
On Mon, 2020-02-24 at 15:29 -0800, Andres Freund wrote: > On 2020-02-22 11:02:16 -0800, Jeff Davis wrote: > > On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote: > > > Both patches, or just 0013? Seems the earlier one might make the > > > addition of the opcodes you add less verbose? > > > > J

Re: Memory-Bounded Hash Aggregation

2020-02-24 Thread Andres Freund
On 2020-02-22 11:02:16 -0800, Jeff Davis wrote: > On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote: > > Both patches, or just 0013? Seems the earlier one might make the > > addition of the opcodes you add less verbose? > > Just 0013, thank you. 0008 looks like it will simplify things. Pushe

Re: Memory-Bounded Hash Aggregation

2020-02-22 Thread Tomas Vondra
On Thu, Feb 20, 2020 at 04:56:38PM -0800, Jeff Davis wrote: On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote: 1) explain.c currently does this: I wonder if we could show something for plain explain (without analyze). At least the initial estimate of partitions, etc. I know not showing thos

Re: Memory-Bounded Hash Aggregation

2020-02-22 Thread Jeff Davis
On Sat, 2020-02-22 at 10:00 -0800, Andres Freund wrote: > Both patches, or just 0013? Seems the earlier one might make the > addition of the opcodes you add less verbose? Just 0013, thank you. 0008 looks like it will simplify things. Regards, Jeff Davis

Re: Memory-Bounded Hash Aggregation

2020-02-22 Thread Andres Freund
Hi, On 2020-02-22 09:55:26 -0800, Jeff Davis wrote: > On Fri, 2020-02-21 at 12:22 -0800, Andres Freund wrote: > > I'd also like to apply something like 0013 from that thread, I find > > the > > whole curperagg, select_current_set, curaggcontext logic confusing as > > hell. I'd so far planned to pu

Re: Memory-Bounded Hash Aggregation

2020-02-22 Thread Jeff Davis
On Fri, 2020-02-21 at 12:22 -0800, Andres Freund wrote: > I'd also like to apply something like 0013 from that thread, I find > the > whole curperagg, select_current_set, curaggcontext logic confusing as > hell. I'd so far planned to put this on the backburner until this > patch > has been committe

Re: Memory-Bounded Hash Aggregation

2020-02-21 Thread Andres Freund
Hi, On 2020-02-19 20:16:36 +0100, Tomas Vondra wrote: > 3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to > just add a new flag to the agg_* structs instead? I haven't tried hacking > this, so maybe it's a silly idea. New opcodes don't really cost that much - it's a jump tab

Re: Memory-Bounded Hash Aggregation

2020-02-20 Thread Jeff Davis
On Wed, 2020-02-19 at 20:16 +0100, Tomas Vondra wrote: > 1) explain.c currently does this: > > I wonder if we could show something for plain explain (without > analyze). > At least the initial estimate of partitions, etc. I know not showing > those details until after execution is what e.g. sort d

Re: Memory-Bounded Hash Aggregation

2020-02-19 Thread Adam Lee
On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote: > 5) Assert(nbuckets > 0); > ... > This however quickly fails on this assert in BuildTupleHashTableExt (see > backtrace1.txt): > > Assert(nbuckets > 0); > > The value is computed in hash_choose_num_buckets, and there seem to be > no

Re: Memory-Bounded Hash Aggregation

2020-02-19 Thread Adam Lee
On Wed, Feb 19, 2020 at 08:16:36PM +0100, Tomas Vondra wrote: > 4) lookup_hash_entries says > > /* check to see if we need to spill the tuple for this grouping set */ > > But that seems bogus, because AFAIK we can't spill tuples for grouping > sets. So maybe this should say just "grouping"? As

Re: Memory-Bounded Hash Aggregation

2020-02-19 Thread Tomas Vondra
Hi, I've started reviewing the 20200218 version of the patch. In general it seems fine, but I have a couple minor comments and two crashes. 1) explain.c currently does this: I wonder if we could show something for plain explain (without analyze). At least the initial estimate of partitions, et

Re: Memory-Bounded Hash Aggregation

2020-02-18 Thread Melanie Plageman
On Tue, Feb 18, 2020 at 10:57 AM Tomas Vondra wrote: > Hi, > > I wanted to take a look at this thread and do a review, but it's not > very clear to me if the recent patches posted here are independent or > how exactly they fit together. I see > > 1) hashagg-20200212-1.patch (2020/02/13 by Jeff) >

Re: Memory-Bounded Hash Aggregation

2020-02-18 Thread Tomas Vondra
Hi, I wanted to take a look at this thread and do a review, but it's not very clear to me if the recent patches posted here are independent or how exactly they fit together. I see 1) hashagg-20200212-1.patch (2020/02/13 by Jeff) 2) refactor.patch (2020/02/13 by Jeff) 3) v1-0001-aggregated-unag

Re: Memory-Bounded Hash Aggregation

2020-02-14 Thread Jeff Davis
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote: > On Mon, 2020-02-10 at 15:57 -0800, Jeff Davis wrote: > > Attaching latest version (combined logtape changes along with main > > HashAgg patch). > > I ran a matrix of small performance tests to look for regressions. I ran some more tests, this

Re: Memory-Bounded Hash Aggregation

2020-02-13 Thread Melanie Plageman
On Wed, Jan 8, 2020 at 2:38 AM Heikki Linnakangas wrote: > This makes the assumption that all Aggrefs or GroupingFuncs are at the > top of the TargetEntry. That's not true, e.g.: > > select 0+sum(a) from foo group by b; > > I think find_aggregated_cols() and find_unaggregated_cols() should be > m

Re: Memory-Bounded Hash Aggregation

2020-02-13 Thread Jeff Davis
On Wed, 2020-02-12 at 21:51 -0800, Jeff Davis wrote: > The new patch is basically just rebased -- a few other very minor > changes. I extracted out some minor refactoring of nodeAgg.c that I can commit separately. That will make the main patch a little easier to review. Attached. * split build_ha

Re: Memory-Bounded Hash Aggregation

2020-02-06 Thread Peter Geoghegan
On Thu, Feb 6, 2020 at 12:01 AM Heikki Linnakangas wrote: > It wasn't strictly required for what I was hacking on then. IIRC it > would have saved some memory during sorting, but Peter G felt that it > wasn't worth the trouble, because he made some other changes around the > same time, which made

Re: Memory-Bounded Hash Aggregation

2020-02-06 Thread Heikki Linnakangas
On 05/02/2020 21:56, Jeff Davis wrote: On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote: I'd love to change the LogicalTape API so that you could allocate and free tapes more freely. I wrote a patch to do that, as part of replacing tuplesort.c's polyphase algorithm with a simpler one

Re: Memory-Bounded Hash Aggregation

2020-02-05 Thread Jeff Davis
On Fri, 2020-01-24 at 17:01 -0800, Jeff Davis wrote: > New patch attached. Three minor independent refactoring patches: 1. Add new entry points for the tuple hash table: TupleHashTableHash() LookupTupleHashEntryHash() which are useful for saving and reusing hash values to avoid recomputing.

Re: Memory-Bounded Hash Aggregation

2020-02-05 Thread Jeff Davis
On Wed, 2020-02-05 at 11:56 -0800, Jeff Davis wrote: > Regarding the API, I'd like to change it, but I'm running into some > performance challenges when adding a layer of indirection. If I apply > the very simple attached patch, which simply makes a separate > allocation for the tapes array, it see

Re: Memory-Bounded Hash Aggregation

2020-02-05 Thread Jeff Davis
On Tue, 2020-02-04 at 18:10 +0200, Heikki Linnakangas wrote: > I'd love to change the LogicalTape API so that you could allocate > and > free tapes more freely. I wrote a patch to do that, as part of > replacing > tuplesort.c's polyphase algorithm with a simpler one (see [1]), but > I > never go

Re: Memory-Bounded Hash Aggregation

2020-02-05 Thread Peter Geoghegan
On Wed, Feb 5, 2020 at 10:37 AM Jeff Davis wrote: > > LogicalTapeSetExtend() seems to work in a way that assumes that the > > tape is frozen. It would be good to document that assumption, and > > possible enforce it by way of an assertion. The same remark applies > > to > > any other assumptions y

Re: Memory-Bounded Hash Aggregation

2020-02-05 Thread Jeff Davis
On Tue, 2020-02-04 at 15:08 -0800, Peter Geoghegan wrote: > Have you tested this against tuplesort.c, particularly parallel > CREATE > INDEX? It would be worth trying to measure any performance impact. > Note that most parallel CREATE INDEX tuplesorts will do a merge > within > each worker, and one

Re: Memory-Bounded Hash Aggregation

2020-02-05 Thread Jeff Davis
On Tue, 2020-02-04 at 18:42 +0800, Adam Lee wrote: > The minheap looks good, have you tested the performance and aggregate > validation? Not sure exactly what you mean, but I tested the min heap with both Sort and HashAgg and it performs well. > About the "Cap the size of the LogicalTapeSet's fre

Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Thomas Munro
On Wed, Feb 5, 2020 at 12:08 PM Peter Geoghegan wrote: > Parallel CREATE INDEX is currently accidentally disabled on the master > branch. That should be fixed in the next couple of days. You can > temporarily revert 74618e77 if you want to get it back for testing > purposes today. (Fixed -- sorry

Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Peter Geoghegan
On Mon, Feb 3, 2020 at 6:24 PM Jeff Davis wrote: > And now I'm attaching another version of the main Hash Aggregation > patch to be applied on top of the logtape.c patch. Have you tested this against tuplesort.c, particularly parallel CREATE INDEX? It would be worth trying to measure any performa

Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Heikki Linnakangas
On 03/02/2020 20:29, Jeff Davis wrote: 1. Use a minheap for the freelist. The original design used an array that had to be sorted between a read (which frees a block) and a write (which needs to sort the array to consume the lowest block number). The comments said: * sorted. This is an effic

Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Adam Lee
On Mon, Feb 03, 2020 at 06:24:14PM -0800, Jeff Davis wrote: > On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote: > > I ended up converting the freelist to a min heap. > > > > Attached is a patch which makes three changes to better support > > HashAgg: > > And now I'm attaching another version o

Re: Memory-Bounded Hash Aggregation

2020-02-03 Thread Jeff Davis
On Wed, 2020-01-29 at 14:48 -0800, Jeff Davis wrote: > 2) Use a different structure more capable of handling a large > fraction > of free space. A compressed bitmap might make sense, but that seems > like overkill to waste effort tracking a lot of space that is > unlikely > to ever be used. I ende

Re: Memory-Bounded Hash Aggregation

2020-01-29 Thread Jeff Davis
On Fri, 2020-01-24 at 17:16 -0800, Peter Geoghegan wrote: > That sounds weird. Might be pathological in some sense. > > I have a wild guess for you. Maybe this has something to do with the > "test for presorted input" added by commit a3f0b3d68f9. That can > perform very badly when the input is alm

Re: Memory-Bounded Hash Aggregation

2020-01-24 Thread Peter Geoghegan
On Fri, Jan 24, 2020 at 5:01 PM Jeff Davis wrote: > Unfortunately, I'm seeing some bad behavior (at least in some cases) > with logtape.c, where it's spending a lot of time qsorting the list of > free blocks. Adam, did you also see this during your perf tests? It > seems to be worst with lower wor

Re: Memory-Bounded Hash Aggregation

2020-01-08 Thread Heikki Linnakangas
On 28/12/2019 01:35, Jeff Davis wrote: I've attached a new patch that adds some basic costing for disk during hashagg. This patch (hashagg-20191227.patch) doesn't compile: nodeAgg.c:3379:7: error: ‘hashagg_mem_overflow’ undeclared (first use in this function) if (hashagg_mem_overflow)

Re: Memory-Bounded Hash Aggregation

2020-01-07 Thread Adam Lee
Hi, Jeff I tried to use the logical tape APIs for hash agg spilling, based on your 1220 version. Turns out it doesn't make much of performance difference with the default 8K block size (might be my patch's problem), but the disk space (not I/O) would be saved a lot because I force the respilling

Re: Memory-Bounded Hash Aggregation

2019-12-27 Thread Jeff Davis
On Sat, 2019-12-14 at 18:32 +0100, Tomas Vondra wrote: > So I think we're not costing the batching properly / at all. Hi, I've attached a new patch that adds some basic costing for disk during hashagg. The accuracy is unfortunately not great, especially at smaller work_mem sizes and smaller entr

Re: Memory-Bounded Hash Aggregation

2019-12-22 Thread Jeff Davis
On Sat, 2019-12-14 at 18:32 +0100, Tomas Vondra wrote: > So I think we're not costing the batching properly / at all. Thank you for all of the testing! I think the results are good: even for cases where HashAgg is the wrong choice, it's not too bad. You're right that costing is not done, and when

Re: Memory-Bounded Hash Aggregation

2019-12-22 Thread Jeff Davis
On Tue, 2019-12-10 at 13:34 -0800, Adam Lee wrote: > Melanie and I tried this, had a installcheck passed patch. The way > how > we verify it is composing a wide table with long unnecessary text > columns, then check the size it writes on every iteration. > > Please check out the attachment, it's b

Re: Memory-Bounded Hash Aggregation

2019-12-20 Thread Adam Lee
On Sat, Dec 14, 2019 at 06:32:25PM +0100, Tomas Vondra wrote: > I've done a bit more testing on this, after resolving a couple of minor > conflicts due to recent commits (rebased version attached). > > In particular, I've made a comparison with different dataset sizes, > group sizes, GUC settings

Re: Memory-Bounded Hash Aggregation

2019-12-14 Thread Tomas Vondra
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote: On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: Right now the patch always initializes 32 spill partitions. Have you given any thought into how to intelligently pick an optimal number of partitions yet? Attached a new patch th

Re: Memory-Bounded Hash Aggregation

2019-12-13 Thread Tomas Vondra
On Thu, Dec 12, 2019 at 06:10:50PM -0800, Jeff Davis wrote: On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: 13) As for this: /* make sure that we don't exhaust the hash bits */ if (partition_bits + input_bits >= 32) partition_bits = 32 - input_bits; We already ran int

Re: Memory-Bounded Hash Aggregation

2019-12-12 Thread Jeff Davis
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > 13) As for this: > > /* make sure that we don't exhaust the hash bits */ > if (partition_bits + input_bits >= 32) > partition_bits = 32 - input_bits; > > We already ran into this issue (exhausting bits in a hash value) in

Re: Memory-Bounded Hash Aggregation

2019-12-10 Thread Adam Lee
On Wed, Dec 04, 2019 at 10:57:51PM -0800, Jeff Davis wrote: > > About the `TODO: project needed attributes only` in your patch, when > > would the input tuple contain columns not needed? It seems like > > anything > > you can project has to be in the group or aggregates. > > If you have a table li

Re: Memory-Bounded Hash Aggregation

2019-12-05 Thread Andres Freund
On 2019-12-05 12:55:51 -0800, Jeff Davis wrote: > On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > > And it's not clear to me why we should remove part of the comment > > before > > TupleHashTableHash. > > It looks like 5dfc1981 changed the signature of TupleHashTableHash > without updatin

Re: Memory-Bounded Hash Aggregation

2019-12-05 Thread Tomas Vondra
On Thu, Dec 05, 2019 at 12:55:51PM -0800, Jeff Davis wrote: On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: And it's not clear to me why we should remove part of the comment before TupleHashTableHash. It looks like 5dfc1981 changed the signature of TupleHashTableHash without updating th

Re: Memory-Bounded Hash Aggregation

2019-12-05 Thread Jeff Davis
On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote: > And it's not clear to me why we should remove part of the comment > before > TupleHashTableHash. It looks like 5dfc1981 changed the signature of TupleHashTableHash without updating the comment, so it doesn't really make sense any more. I jus

Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Jeff Davis
On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote: > > It looks like the parameter input_tuples passed to hash_spill_init() > in lookup_hash_entries() is the number of groups estimated by > planner. > However, when reloading a spill file, if we run out of memory and > re-spill, hash_spill

Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Jeff Davis
On Wed, 2019-12-04 at 19:50 -0800, Adam Lee wrote: > On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote: > > > > Thanks very much for a great review! I've attached a new patch. > > Hi, > > About the `TODO: project needed attributes only` in your patch, when > would the input tuple contai

Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Adam Lee
On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote: > > Thanks very much for a great review! I've attached a new patch. Hi, About the `TODO: project needed attributes only` in your patch, when would the input tuple contain columns not needed? It seems like anything you can project has to

Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Jeff Davis
Thanks very much for a great review! I've attached a new patch. There are some significant changes in the new version also: In the non-spilling path, removed the extra nullcheck branch in the compiled evaltrans expression. When the first tuple is spilled, I the branch becomes necessary, so I rec

Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Melanie Plageman
On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra wrote: > On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote: > >On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: > >> Right now the patch always initializes 32 spill partitions. Have you > >> given > >> any thought into how to intelligen

Re: Memory-Bounded Hash Aggregation

2019-11-28 Thread Tomas Vondra
On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote: On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: Right now the patch always initializes 32 spill partitions. Have you given any thought into how to intelligently pick an optimal number of partitions yet? Attached a new patch th

Re: Memory-Bounded Hash Aggregation

2019-11-27 Thread Jeff Davis
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: > Right now the patch always initializes 32 spill partitions. Have you > given > any thought into how to intelligently pick an optimal number of > partitions yet? Attached a new patch that addresses this. 1. Divide hash table memory used by

Re: Memory-Bounded Hash Aggregation

2019-08-29 Thread Jeff Davis
On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote: > I started to review this patch yesterday with Melanie Plageman, so we > rebased this patch over the current master. The main conflicts were > due to a simplehash patch that has been committed separately[1]. I've > attached the rebased patch.

Re: Memory-Bounded Hash Aggregation

2019-08-28 Thread Taylor Vesely
I started to review this patch yesterday with Melanie Plageman, so we rebased this patch over the current master. The main conflicts were due to a simplehash patch that has been committed separately[1]. I've attached the rebased patch. I was playing with the code, and if one of the table's most co

Re: Memory-Bounded Hash Aggregation

2019-08-02 Thread Tomas Vondra
On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote: On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: I'm late to the party. You are welcome to join any time! These two approaches both spill the input tuples, what if the skewed groups are not encountered before the hash table fills

Re: Memory-Bounded Hash Aggregation

2019-08-02 Thread Jeff Davis
On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote: > I'm late to the party. You are welcome to join any time! > These two approaches both spill the input tuples, what if the skewed > groups are not encountered before the hash table fills up? The spill > files' size and disk I/O could be downsides

Re: Memory-Bounded Hash Aggregation

2019-08-01 Thread Adam Lee
> High-level approaches: > > 1. When the in-memory hash table fills, keep existing entries in the > hash table, and spill the raw tuples for all new groups in a > partitioned fashion. When all input tuples are read, finalize groups > in memory and emit. Now that the in-memory hash table is cleared

Re: Memory-Bounded Hash Aggregation

2019-07-12 Thread Tomas Vondra
On Thu, Jul 11, 2019 at 06:06:33PM -0700, Jeff Davis wrote: On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote: Makes sense. I haven't thought about how the hybrid approach would be implemented very much, so I can't quite judge how complicated would it be to extend "approach 1" later. But if

Re: Memory-Bounded Hash Aggregation

2019-07-11 Thread Jeff Davis
On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote: > Makes sense. I haven't thought about how the hybrid approach would be > implemented very much, so I can't quite judge how complicated would > it be > to extend "approach 1" later. But if you think it's a sensible first > step, > I trust you.

Re: Memory-Bounded Hash Aggregation

2019-07-11 Thread Tomas Vondra
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote: On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote: What does "partitioned hash strategy" do? It's probably explained in one of the historical discussions, but I'm not sure which one. I assume it simply hashes the group keys and uses

Re: Memory-Bounded Hash Aggregation

2019-07-04 Thread Tomas Vondra
On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote: On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote: What does "partitioned hash strategy" do? It's probably explained in one of the historical discussions, but I'm not sure which one. I assume it simply hashes the group keys and uses

Re: Memory-Bounded Hash Aggregation

2019-07-03 Thread Jeff Davis
On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote: > What does "partitioned hash strategy" do? It's probably explained in > one > of the historical discussions, but I'm not sure which one. I assume > it > simply hashes the group keys and uses that to partition the data, and > then > passing it

Re: Memory-Bounded Hash Aggregation

2019-07-03 Thread Jeff Davis
On Mon, 2019-07-01 at 12:13 -0700, Jeff Davis wrote: > This is for design review. I have a patch (WIP) for Approach 1, and > if > this discussion starts to converge on that approach I will polish and > post it. WIP patch attached (based on 9a81c9fa); targeting September CF. Not intended for detai

Re: Memory-Bounded Hash Aggregation

2019-07-02 Thread Tomas Vondra
Hi Jeff, On Mon, Jul 01, 2019 at 12:13:53PM -0700, Jeff Davis wrote: This is for design review. I have a patch (WIP) for Approach 1, and if this discussion starts to converge on that approach I will polish and post it. Thanks for working on this. Let's start at the beginning: why do we have