On 3/16/19 10:26 PM, Dean Rasheed wrote: > On Fri, 15 Mar 2019 at 00:06, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: >> ... attached patch ... > > Some more review comments, carrying on from where I left off: > > 16). This regression test fails for me: > > @@ -654,11 +654,11 @@ > -- check change of unrelated column type does not reset the MCV statistics > ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64); > SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = > 1 AND b = ''1'''); > estimated | actual > -----------+-------- > - 50 | 50 > + 11 | 50 > (1 row) > > Maybe that's platform-dependent, given what you said about > reltuples/relpages being reset. An easy workaround for this would be > to modify this test (and perhaps the one that follows) to just query > pg_statistic_ext to see if the MCV statistics have been reset. >
Ah, sorry for not explaining this bit - the failure is expected, due to the reset of relpages/reltuples I mentioned. We do keep the extended stats, but the relsize estimate changes a bit. It surprised me a bit, and this test made the behavior apparent. The last patchset included a piece that changes that - if we decide not to change this, I think we can simply accept the actual output. > 17). I'm definitely preferring the new style of tests because they're > much neater and easier to read, and to directly see the effect of the > extended statistics. One thing I'd consider adding is a query of > pg_statistic_ext using pg_mcv_list_items() after creating the MCV > stats, both to test that function, and to show that the MCV lists have > the expected contents (provided that output isn't too large). > OK, will do. > 18). Spurious whitespace added to src/backend/statistics/mvdistinct.c. > fixed > 19). In the function comment for statext_mcv_clauselist_selectivity(), > the name at the top doesn't match the new function name. Also, I think > it should mention MCV in the initial description. I.e., instead of > > +/* > + * mcv_clauselist_selectivity > + * Estimate clauses using the best multi-column statistics. > > it should say: > > +/* > + * statext_mcv_clauselist_selectivity > + * Estimate clauses using the best multi-column MCV statistics. > fixed > 20). Later in the same comment, this part should now be deleted: > > + * > + * So (simple_selectivity - base_selectivity) may be seen as a correction for > + * the part not covered by the MCV list. > fixed > 21). For consistency with other bms_ functions, I think the name of > the Bitmapset argument for bms_member_index() should just be called > "a". Nitpicking, I'd also put bms_member_index() immediately after > bms_is_member() in the source, to match the header. > I think I've already done the renames in the last patch I submitted (are you looking at an older version of the code, perhaps?). I've moved it right after bms_is_member - good idea. > 22). mcv_get_match_bitmap() should really use an array of bool rather > than an array of char. Note that a bool is guaranteed to be of size 1, > so it won't make things any less efficient, but it will allow some > code to be made neater. E.g., all clauses like "matches[i] == false" > and "matches[i] != false" can just be made "!matches[i]" or > "matches[i]". Also the Min/Max expressions on those match flags can be > replaced with the logical operators && and ||. > fixed > 23). Looking at this code in statext_mcv_build(): > > /* store info about data type OIDs */ > i = 0; > j = -1; > while ((j = bms_next_member(attrs, j)) >= 0) > { > VacAttrStats *colstat = stats[i]; > > mcvlist->types[i] = colstat->attrtypid; > i++; > } > > it isn't actually making use of the attribute numbers (j) from attrs, > so this could be simplified to: > > /* store info about data type OIDs */ > for (i = 0; i < numattrs; i++) > mcvlist->types[i] = stats[i]->attrtypid; > yep, fixed > 24). Later in that function, the following comment doesn't appear to > make sense. Is this possibly from an earlier version of the code? > > /* copy values from the _previous_ group (last item of) */ > yep, seems like a residue from an older version, fixed > 25). As for (23), in build_mss(), the loop over the Bitmapset of > attributes never actually uses the attribute numbers (j), so that > could just be a loop from i=0 to numattrs-1, and then that function > doesn't need to be passed the Bitmapset at all -- it could just be > passed the integer numattrs. > fixed > 26). build_distinct_groups() looks like it makes an implicit > assumption that the counts of the items passed in are all zero. That > is indeed the case, if they've come from build_sorted_items(), because > that does a palloc0(), but that feels a little fragile. I think it > would be better if build_distinct_groups() explicitly set the count > each time it detects a new group. > good idea, fixed > 27). In statext_mcv_serialize(), the TODO comment says > > * TODO: Consider packing boolean flags (NULL) for each item into a single > char > * (or a longer type) instead of using an array of bool items. > > A more efficient way to save space might be to do away with the > boolean null flags entirely, and just use a special index value like > 0xffff to signify a NULL value. > Hmmm, maybe. I think there's a room for improvement. > 28). I just spotted the 1MB limit on the serialised MCV list size. I > think this is going to be too limiting. For example, if the stats > target is at its maximum of 10000, that only leaves around 100 bytes > for each item's values, which is easily exceeded. In fact, I think > this approach for limiting the MCV list size isn't a good one -- > consider what would happen if there were lots of very large values. > Would it run out of memory before getting to that test? Even if not, > it would likely take an excessive amount of time. > True. I don't have a very good argument for a specific value, or even having an explicit limit at all. I've initially added it mostly as a safety for development purposes, but I think you're right we can just get rid of it. I don't think it'd run out of memory before hitting the limit, but I haven't tried very hard (but I recall running into the 1MB limit in the past). > I think this part of the patch needs a bit of a rethink. My first > thought is to do something similar to what happens for per-column > MCVs, and set an upper limit on the size of each value that is ever > considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and > toowide_cnt in analyse.c). Over-wide values should be excluded early > on, and it will need to track whether or not any such values were > excluded, because then it wouldn't be appropriate to treat the stats > as complete and keep the entire list, without calling > get_mincount_for_mcv_list(). > Which part? Serialization / deserialization? Or how we handle long values when building the MCV list? cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0001-retain-reltuples-relpages-on-ALTER-TABLE-20190317.patch.gz
Description: application/gzip
0002-multivariate-MCV-lists-20190317.patch.gz
Description: application/gzip
0003-multivariate-histograms-20190317.patch.gz
Description: application/gzip