Hello, When calculating the disk costs of hash aggregation that spills to disk, there is something wrong with how we determine depth:
> depth = ceil( log(nbatches - 1) / log(num_partitions) ); If nbatches is some number between 1.0 and 2.0, we would have a negative depth. As a result, we may have a negative cost for hash aggregation plan node, as described in [1]. I don't think 'log(nbatches - 1)' is what we want here. Should it be just '(nbatches - 1)'? [1] https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com Thanks Richard On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pg...@j-davis.com> wrote: > > Committed. > > There's some future work that would be nice (some of these are just > ideas and may not be worth it): > > * Refactor MemoryContextMemAllocated() to be a part of > MemoryContextStats(), but allow it to avoid walking through the blocks > and freelists. > > * Improve the choice of the initial number of buckets in the hash > table. For this patch, I tried to preserve the existing behavior of > estimating the number of groups and trying to initialize with that many > buckets. But my performance tests seem to indicate this is not the best > approach. More work is needed to find what we should really do here. > > * For workloads that are not in work_mem *or* system memory, and need > to actually go to storage, I see poor CPU utilization because it's not > effectively overlapping CPU and IO work. Perhaps buffering or readahead > changes can improve this, or async IO (even better). > > * Project unnecessary attributes away before spilling tuples to disk. > > * Improve logtape.c API so that the caller doesn't need to manage a > bunch of tape numbers. > > * Improve estimate of the hash entry size. This patch doesn't change > the way the planner estimates it, but I observe that actual size as > seen at runtime is significantly different. This is connected to the > initial number of buckets for the hash table. > > * In recursive steps, I don't have a good estimate for the number of > groups, so I just estimate it as the number of tuples in that spill > tape (which is pessimistic). That could be improved by doing a real > cardinality estimate as the tuples are spilling (perhaps with HLL?). > > * Many aggregates with pass-by-ref transition states don't provide a > great aggtransspace. We should consider doing something smarter, like > having negative numbers represent a number that should be multiplied by > the size of the group (e.g. ARRAY_AGG would have a size dependent on > the group size, not a constant). > > * If we want to handle ARRAY_AGG (and the like) well, we can consider > spilling the partial states in the hash table whem the memory is full. > That would add a fair amount of complexity because there would be two > types of spilled data (tuples and partial states), but it could be > useful in some cases. > > Regards, > Jeff Davis > > > > >