Committed.
There's some future work that would be nice (some of these are just ideas and may not be worth it): * Refactor MemoryContextMemAllocated() to be a part of MemoryContextStats(), but allow it to avoid walking through the blocks and freelists. * Improve the choice of the initial number of buckets in the hash table. For this patch, I tried to preserve the existing behavior of estimating the number of groups and trying to initialize with that many buckets. But my performance tests seem to indicate this is not the best approach. More work is needed to find what we should really do here. * For workloads that are not in work_mem *or* system memory, and need to actually go to storage, I see poor CPU utilization because it's not effectively overlapping CPU and IO work. Perhaps buffering or readahead changes can improve this, or async IO (even better). * Project unnecessary attributes away before spilling tuples to disk. * Improve logtape.c API so that the caller doesn't need to manage a bunch of tape numbers. * Improve estimate of the hash entry size. This patch doesn't change the way the planner estimates it, but I observe that actual size as seen at runtime is significantly different. This is connected to the initial number of buckets for the hash table. * In recursive steps, I don't have a good estimate for the number of groups, so I just estimate it as the number of tuples in that spill tape (which is pessimistic). That could be improved by doing a real cardinality estimate as the tuples are spilling (perhaps with HLL?). * Many aggregates with pass-by-ref transition states don't provide a great aggtransspace. We should consider doing something smarter, like having negative numbers represent a number that should be multiplied by the size of the group (e.g. ARRAY_AGG would have a size dependent on the group size, not a constant). * If we want to handle ARRAY_AGG (and the like) well, we can consider spilling the partial states in the hash table whem the memory is full. That would add a fair amount of complexity because there would be two types of spilled data (tuples and partial states), but it could be useful in some cases. Regards, Jeff Davis