Re: Memory-Bounded Hash Aggregation

Jeff Davis Fri, 02 Aug 2019 08:12:17 -0700

On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:
> I'm late to the party.


You are welcome to join any time!

> These two approaches both spill the input tuples, what if the skewed
> groups are not encountered before the hash table fills up? The spill
> files' size and disk I/O could be downsides.

Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).

> Greenplum spills all the groups by writing the partial aggregate
> states,
> reset the memory context, process incoming tuples and build in-memory
> hash table, then reload and combine the spilled partial states at
> last,
> how does this sound?

That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.

It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.

Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.

Regards,
        Jeff Davis

Re: Memory-Bounded Hash Aggregation

Reply via email to