On Wed, 2018-06-13 at 10:08 -0400, Robert Haas wrote: > I wasn't using the term "average" in a mathematical sense. I suppose > I really meant "typical". I agree that thinking about cases where > the > group size is nonuniform is a good idea, but I don't think I agree > that all uniform distributions are created equal. Uniform > distributions with 1 row per group are very different than uniform > distributions with 1000 rows per group.
The only mechanism we have for dealing efficiently with skewed groups is hashing; no alternative has been proposed. I think what you are saying is that the case of medium-large groups with clustered input are kind of like skewed groups because they have enough locality to benefit from grouping before spilling. I can see that. So how do we handle both of these cases (skewed groups and clustered groups) well? I tend toward a simple approach, which is to just put whatever we can in the hash table. Once it's full, if the hit rate on the hash table (the whole table) drops below a certain threshold, we just dump all the transition states out to disk and empty the hash table. That would handle both skewed groups and clustering fairly well. Clustering would cause the hit rate to rapidly go to zero when we are past the current batch of groups, causing fairly quick switching to new groups which should handle Tomas's case[1]. And it would also handle ordinary skewed groups fairly well -- it may write them out sometimes (depending on how smart our eviction strategy is), but the cost of that is not necessarily bad because it will go back into the hash table quickly. It also offers an incremental implementation strategy: something resembling my patch could be first (which doesn't dump transition states at all), followed by something that can dump transition states, followed by some tweaking to make it smarter. That still leaves the question about what to do with the small groups: partition them (like my patch does) or feed them into a sort and group them? Regards, Jeff Davis [1] https://www.postgresql.org/message-id/46734151-3245-54ac-76fc-17742 fb0e6d9%402ndquadrant.com