On Mon, Mar 26, 2012 at 5:11 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Greg Stark <st...@mit.edu> writes: >> On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> Could you give us a brain dump on the sketch? I've never seen how to >>> do it without unreasonable overhead. > >> Hm. So my original plan was dependent on adding the state-merge >> function we've talked about in the past. Not all aggregate functions >> necessarily can support such a function but I think all or nearly all >> the builtin aggregates can. Certainly min,max, count, sum, avg, >> stddev, array_agg can which are most of what people do. That would be >> a function which can take two state variables and produce a new state >> variable. > > I'd rather not invent new requirements for aggregate implementations > if we can avoid it. > >> However now that I've started thinking about it further I think you >> could solve it with less complexity by cheating in various ways. For >> example if you limit the hash size to 1/2 of work_mem then you when >> you reach that limit you could just stuff any tuple that doesn't match >> a hash entry into a tuplesort with 1/2 of work_mem and do the regular >> level break logic on the output of that. > > Or just start dumping such tuples into a tuplestore, while continuing to > process tuples that match the hashagg entries that are already in > existence. Once the input is exhausted, read out the hashagg entries we > have, flush the hashagg table, start reading from the tuplestore. > Repeat as needed. > > I like this idea because the only thing you give up is predictability of > the order of output of aggregated entries, which is something that a > hashagg isn't guaranteeing anyway. In particular, we would still have a > guarantee that any one aggregate evaluation processes the matching > tuples in arrival order, which is critical for some aggregates. > > The main problem I can see is that if we start to flush after work_mem > is X% full, we're essentially hoping that the state values for the > existing aggregates won't grow by more than 1-X%, which is safe for many > common aggregates but fails for some like array_agg(). Ultimately, for > ones like that, it'd probably be best to never consider hashing at all. > I guess we could invent an "unsafe for hash aggregation" flag for > aggregates that have unbounded state-size requirements. >
According to what I've learned in the last couple of months, array_agg is not ready for fallback ways like dumping to tuplestore. Even merge-state is not able for them. The problem is that the executor doesn't know how to serialize/deserialize the internal type trans value. So in one implementation, the existence of merge function is a flag to switch back to sort grouping not hash aggregate and array_agg is one of such aggregate functions. That said, if you invent a new flag to note the aggregate is not dump-ready, it'd be worth inventing state merge function to aggregate infrastructure anyway. So I can imagine a way without state-merge function nor dumping to tuplestore would be to sort hash table content the rest of inputs so that we can switch to sort grouping. Since we have hash table, we can definitely sort them in memory, and we can put to disk everything that comes later than the fallback and read it after the scan finishes. Now we have sorted state values and sorted input, we can continue the rest of work. Anyway the memory usage problem is not only array_agg and hash aggregate. Even if you can say the hash table exceeds X% of the work_mem, how can you tell other operators such like Sort are not using the rest of memory? One approach I could see to avoid this is assigning arbitrary amount of memory to each operator from work_mem and calculate it locally. But this approach is going to skew occasionally and not perfect, either. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers