Still working through this patch ... there are some things that bother me about the entries being made in pg_statistic:
1. You re-used STATISTIC_KIND_MCELEM for something that, while similar to tsvector's usage, is not the same. In particular, tsvector adds two extra elements to the stanumbers array, but this patch adds four (and doesn't bother documenting that in pg_statistic.h). I don't think it's acceptable to re-use the same stakind value for something that's not following the same specification. I see Nathan complained of this way, way upthread, but nothing was done about it. I think we should either assign a different stakind code for this definition, or change things around so that tsvector actually is using the same stats kind definition as arrays are. (We could get away with redefining the tsvector stats format, because pg_upgrade doesn't try to copy pg_statistic rows to the new database.) Now, of the two new elements added by the patch, it seems to me to be perfectly reasonable to add a null-element frequency to the kind specification; the fact that it wasn't there to start with is kind of an oversight born of the fact that tsvectors don't contain any null lexemes. But the other new element is the average distinct element count, which really does not belong here at all, as it is *entirely* unrelated to element frequencies. It seems to me that that more nearly belongs in the element-count histogram slot. So my preference is to align the two definitions of STATISTIC_KIND_MCELEM by adding a null-element frequency to tsvector's usage (where it'll always be zero) and getting rid of the average distinct element count here. 2. I think STATISTIC_KIND_LENGTH_HISTOGRAM is badly named and confusingly documented. The stats are not about anything I would call a "length" --- rather we're considering the counts of numbers of distinct element values present in each array value. An ideal name perhaps would be STATISTIC_KIND_DISTINCT_ELEMENTS_COUNT_HISTOGRAM, but of course that's unreasonably long. Considering the way that the existing stats kind names are abbreviated, maybe STATISTIC_KIND_DECHIST would do. Anybody have a better idea? 3. I also find it a bit odd that you chose to store the length (count) histogram as an integer array in stavalues. Usually we've put such data in stanumbers. That would make the entries float4 not integer, but that doesn't seem unreasonable to me --- values would still be exact up to 2^24 or so on typical machines, and if we ever do have values larger than that, it seems to me that having headroom to go above 2^32 would be a good thing. In any case, if we're going to move the average distinct-element count over here, that would have to go into stanumbers. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers