On Mon, Jun 11, 2018 at 1:50 PM, Jeff Davis <pg...@j-davis.com> wrote: > I think average group size is the wrong way to look at it, because > averages are only useful for a normal distribution. The two most > interesting cases are: > > 1. Fairly uniform group size (e.g. normal dist) > 2. Skew values, power law distribution, etc., where some groups are > very large and most are tiny by comparison. I am calling the very large > groups "skewed groups".
I wasn't using the term "average" in a mathematical sense. I suppose I really meant "typical". I agree that thinking about cases where the group size is nonuniform is a good idea, but I don't think I agree that all uniform distributions are created equal. Uniform distributions with 1 row per group are very different than uniform distributions with 1000 rows per group. > If we get the skewed groups into the hash table, and avoid OOM, I think > that is a success. My patch did that, except it didn't account for two > cases: > > (a) clustered input, where skewed groups may not appear until the > hash table is already full; and > (b) variable-sized transition values that grow with the group size I think that many of the algorithms under consideration could be implemented without worrying about variable-sized transition values, and then the approach could be extended later. However, whether or not a given algorithm can handle clustered input seems like a fairly basic property of the algorithm. I don't think we should set the bar too high because no algorithm is going to be perfect in every case; at the same time, clustered input is pretty common in real-world scenarios, and an algorithm that handles such cases better has a significant leg up over one that can't, all other things being equal. I'm not sure I remember precisely what your proposal was any more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company