Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Tomas Vondra Sat, 16 Mar 2019 16:46:06 -0700


On 3/16/19 10:26 PM, Dean Rasheed wrote:
> On Fri, 15 Mar 2019 at 00:06, Tomas Vondra <tomas.von...@2ndquadrant.com> 
> wrote:
>> ... attached patch ...
> 
> Some more review comments, carrying on from where I left off:
> 
> 16). This regression test fails for me:
> 
> @@ -654,11 +654,11 @@
>  -- check change of unrelated column type does not reset the MCV statistics
>  ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
>  SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a =
> 1 AND b = ''1''');
>   estimated | actual
>  -----------+--------
> -        50 |     50
> +        11 |     50
>  (1 row)
> 
> Maybe that's platform-dependent, given what you said about
> reltuples/relpages being reset. An easy workaround for this would be
> to modify this test (and perhaps the one that follows) to just query
> pg_statistic_ext to see if the MCV statistics have been reset.
>


Ah, sorry for not explaining this bit - the failure is expected, due to
the reset of relpages/reltuples I mentioned. We do keep the extended
stats, but the relsize estimate changes a bit. It surprised me a bit,
and this test made the behavior apparent. The last patchset included a
piece that changes that - if we decide not to change this, I think we
can simply accept the actual output.

> 17). I'm definitely preferring the new style of tests because they're
> much neater and easier to read, and to directly see the effect of the
> extended statistics. One thing I'd consider adding is a query of
> pg_statistic_ext using pg_mcv_list_items() after creating the MCV
> stats, both to test that function, and to show that the MCV lists have
> the expected contents (provided that output isn't too large).
> 

OK, will do.

> 18). Spurious whitespace added to src/backend/statistics/mvdistinct.c.
> 

fixed

> 19). In the function comment for statext_mcv_clauselist_selectivity(),
> the name at the top doesn't match the new function name. Also, I think
> it should mention MCV in the initial description. I.e., instead of
> 
> +/*
> + * mcv_clauselist_selectivity
> + *        Estimate clauses using the best multi-column statistics.
> 
> it should say:
> 
> +/*
> + * statext_mcv_clauselist_selectivity
> + *        Estimate clauses using the best multi-column MCV statistics.
> 

fixed

> 20). Later in the same comment, this part should now be deleted:
> 
> + *
> + * So (simple_selectivity - base_selectivity) may be seen as a correction for
> + * the part not covered by the MCV list.
> 

fixed

> 21). For consistency with other bms_ functions, I think the name of
> the Bitmapset argument for bms_member_index() should just be called
> "a". Nitpicking, I'd also put bms_member_index() immediately after
> bms_is_member() in the source, to match the header.
> 

I think I've already done the renames in the last patch I submitted (are
you looking at an older version of the code, perhaps?). I've moved it
right after bms_is_member - good idea.

> 22). mcv_get_match_bitmap() should really use an array of bool rather
> than an array of char. Note that a bool is guaranteed to be of size 1,
> so it won't make things any less efficient, but it will allow some
> code to be made neater. E.g., all clauses like "matches[i] == false"
> and "matches[i] != false" can just be made "!matches[i]" or
> "matches[i]". Also the Min/Max expressions on those match flags can be
> replaced with the logical operators && and ||.
> 

fixed

> 23). Looking at this code in statext_mcv_build():
> 
>         /* store info about data type OIDs */
>         i = 0;
>         j = -1;
>         while ((j = bms_next_member(attrs, j)) >= 0)
>         {
>             VacAttrStats *colstat = stats[i];
> 
>             mcvlist->types[i] = colstat->attrtypid;
>             i++;
>         }
> 
> it isn't actually making use of the attribute numbers (j) from attrs,
> so this could be simplified to:
> 
>         /* store info about data type OIDs */
>         for (i = 0; i < numattrs; i++)
>             mcvlist->types[i] = stats[i]->attrtypid;
> 

yep, fixed

> 24). Later in that function, the following comment doesn't appear to
> make sense. Is this possibly from an earlier version of the code?
> 
>             /* copy values from the _previous_ group (last item of) */
> 

yep, seems like a residue from an older version, fixed

> 25). As for (23), in build_mss(), the loop over the Bitmapset of
> attributes never actually uses the attribute numbers (j), so that
> could just be a loop from i=0 to numattrs-1, and then that function
> doesn't need to be passed the Bitmapset at all -- it could just be
> passed the integer numattrs.
> 

fixed

> 26). build_distinct_groups() looks like it makes an implicit
> assumption that the counts of the items passed in are all zero. That
> is indeed the case, if they've come from build_sorted_items(), because
> that does a palloc0(), but that feels a little fragile. I think it
> would be better if build_distinct_groups() explicitly set the count
> each time it detects a new group.
> 

good idea, fixed

> 27). In statext_mcv_serialize(), the TODO comment says
> 
>  * TODO: Consider packing boolean flags (NULL) for each item into a single 
> char
>  * (or a longer type) instead of using an array of bool items.
> 
> A more efficient way to save space might be to do away with the
> boolean null flags entirely, and just use a special index value like
> 0xffff to signify a NULL value.
> 

Hmmm, maybe. I think there's a room for improvement.

> 28). I just spotted the 1MB limit on the serialised MCV list size. I
> think this is going to be too limiting. For example, if the stats
> target is at its maximum of 10000, that only leaves around 100 bytes
> for each item's values, which is easily exceeded. In fact, I think
> this approach for limiting the MCV list size isn't a good one --
> consider what would happen if there were lots of very large values.
> Would it run out of memory before getting to that test? Even if not,
> it would likely take an excessive amount of time.
> 

True. I don't have a very good argument for a specific value, or even
having an explicit limit at all. I've initially added it mostly as a
safety for development purposes, but I think you're right we can just
get rid of it. I don't think it'd run out of memory before hitting the
limit, but I haven't tried very hard (but I recall running into the 1MB
limit in the past).

> I think this part of the patch needs a bit of a rethink. My first
> thought is to do something similar to what happens for per-column
> MCVs, and set an upper limit on the size of each value that is ever
> considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
> toowide_cnt in analyse.c). Over-wide values should be excluded early
> on, and it will need to track whether or not any such values were
> excluded, because then it wouldn't be appropriate to treat the stats
> as complete and keep the entire list, without calling
> get_mincount_for_mcv_list().
> 
Which part? Serialization / deserialization? Or how we handle long
values when building the MCV list?

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-retain-reltuples-relpages-on-ALTER-TABLE-20190317.patch.gz
Description: application/gzip

0002-multivariate-MCV-lists-20190317.patch.gz
Description: application/gzip

0003-multivariate-histograms-20190317.patch.gz
Description: application/gzip

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to