Hello,

When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:

>            depth = ceil( log(nbatches - 1) / log(num_partitions) );

If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].

I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?

[1]
https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com

Thanks
Richard

On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pg...@j-davis.com> wrote:

>
> Committed.
>
> There's some future work that would be nice (some of these are just
> ideas and may not be worth it):
>
> * Refactor MemoryContextMemAllocated() to be a part of
> MemoryContextStats(), but allow it to avoid walking through the blocks
> and freelists.
>
> * Improve the choice of the initial number of buckets in the hash
> table. For this patch, I tried to preserve the existing behavior of
> estimating the number of groups and trying to initialize with that many
> buckets. But my performance tests seem to indicate this is not the best
> approach. More work is needed to find what we should really do here.
>
> * For workloads that are not in work_mem *or* system memory, and need
> to actually go to storage, I see poor CPU utilization because it's not
> effectively overlapping CPU and IO work. Perhaps buffering or readahead
> changes can improve this, or async IO (even better).
>
> * Project unnecessary attributes away before spilling tuples to disk.
>
> * Improve logtape.c API so that the caller doesn't need to manage a
> bunch of tape numbers.
>
> * Improve estimate of the hash entry size. This patch doesn't change
> the way the planner estimates it, but I observe that actual size as
> seen at runtime is significantly different. This is connected to the
> initial number of buckets for the hash table.
>
> * In recursive steps, I don't have a good estimate for the number of
> groups, so I just estimate it as the number of tuples in that spill
> tape (which is pessimistic). That could be improved by doing a real
> cardinality estimate as the tuples are spilling (perhaps with HLL?).
>
> * Many aggregates with pass-by-ref transition states don't provide a
> great aggtransspace. We should consider doing something smarter, like
> having negative numbers represent a number that should be multiplied by
> the size of the group (e.g. ARRAY_AGG would have a size dependent on
> the group size, not a constant).
>
> * If we want to handle ARRAY_AGG (and the like) well, we can consider
> spilling the partial states in the hash table whem the memory is full.
> That would add a fair amount of complexity because there would be two
> types of spilled data (tuples and partial states), but it could be
> useful in some cases.
>
> Regards,
>         Jeff Davis
>
>
>
>
>

Reply via email to