Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Tomas Vondra Wed, 16 Jan 2019 17:20:17 -0800

On 1/16/19 7:56 AM, David Rowley wrote:> On Tue, 15 Jan 2019 at 08:21,Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:

>> Turns out you were right - the attribute_referenced piece was quite
>> unnecessary. So I've removed it. I've also extended the regression tests
>> to verify changing type of another column does not reset the stats.
>
> (Trying to find my feet over here)
>
> I've read over the entire thread, and apart from missing the last two
> emails and therefore the latest patch, I managed to read over most of
> the MCV patch. I didn't quite get to reading mcv.c and don't quite
> have the energy to take that on now.
>
Thanks for looking!

> At this stage I'm trying to get to know the patch.  I read a lot of
> discussing between you and Dean ironing out how the stats should be
> used to form selectivities.  At the time I'd not read the patch yet,
> so most of it went over my head.
>
> I did note down a few things on my read.  I've included them below.
> Hopefully, they're useful.
>
> MCV list review
>
> 1. In mvc.c there's Assert(ndistinct <= UINT16_MAX); This should be
> PG_UINT16_MAX
>
Yep. Will fix.

> 2. math.h should be included just after postgres.h
>
Yep. Will fix.

> 3. Copyright is still -2017 in mcv.c.  Hopefully, if you change it to
> 2019, you'll never have to bump it ever again! :-)
>
Optimist ;-)

> 4. Looking at pg_stats_ext_mcvlist_items() I see you've coded the
> string building manually. The way it's coded I'm finding a little
> strange.  It means the copying becomes quadratic due to
>
> snprintf(buff, 1024, format, values[1], DatumGetPointer(valout));
> strncpy(values[1], buff, 1023);
>
> So basically, generally, here you're building a new string with
> values[1] followed by a comma, then followed by valout. One the next
> line you then copy that new buffer back into values[1].  I understand
> this part is likely not performance critical, but I see no reason to
> write the code this way.
>
> Are you limiting the strings to 1024 bytes on purpose?  I don't see
> any comment mentioning you want to truncate strings.
>
> Would it not be better to do this part using a
> AppendStringInfoString()? and just manually add a '{', ',' or '}' as
> and when required?
>> DatumGetPointer(valout) should really be using DatumGetCString(valout).
>
> Likely you can also use heap_form_tuple.  This will save you having to
> convert ints into strings then only to have BuildTupleFromCStrings()
> do the reverse.
>

I agree. I admit all of this is a residue of an initial hackish versionof the function, and should be changed to StringInfo. Will fix.


> 5. individiaul -> individual

> lists. This allows very accurate estimates for individiaulcolumns, but

>
>      litst -> lists
>

> litst on combinations of columns. Similarly to functionaldependencies

>
Will fix.

> 6. Worth mentioning planning cycles too?
>

> "It's advisable to create <literal>MCV</literal> statisticsobjects only> on combinations of columns that are actually used in conditionstogether,> and for which misestimation of the number of groups isresulting in bad> plans. Otherwise, the <command>ANALYZE</command> cycles arejust wasted."

Makes sense. Although that's what we say about the existing stats, soperhaps we should tweak that too.


> 7. straight-forward -> straightforward
>

> (most-common values) lists, a straight-forward extension of theper-column

> > 8. adresses -> addresses
>
> statistics adresses the limitation by storing individual values, but it
>
Will fix. Thanks for proof-reading.

> 9. Worth mentioning ANALYZE time?
>

> This section introduces multivariate variant of<acronym>MCV</acronym>> (most-common values) lists, a straight-forward extension of theper-column> statistics described in <xreflinkend="row-estimation-examples"/>. This> statistics adresses the limitation by storing individual values,but it> is naturally more expensive, both in terms of storage andplanning time.

>
Yeah.

> 10. low -> a low
>

> with low number of distinct values. Before looking at the secondquery,

>
> 11. them -> then
>

> on items in the <acronym>MCV</acronym> list, and them sums thefrequencies

>
Will fix.

> 12.  Should we be referencing the source from the docs?
>
> See <function>mcv_clauselist_selectivity</function>
>      in <filename>src/backend/statistics/mcv.c</filename> for details.
>
> hmm. I see it's not the first going by: git grep -E "\w+\.c\<"
> gt

Hmm, that does not return anything to me - do you actually see anyreferences to .c files in the sgml docs? I agree that probably is not agood idea, so I'll remove that.


> 13. Pretty minor, but the following loop in
> UpdateStatisticsForTypeChange() could use a break;
>
> attribute_referenced = false;
> for (i = 0; i < staForm->stxkeys.dim1; i++)
> if (attnum == staForm->stxkeys.values[i])
> attribute_referenced = true;
>
> UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
> removed this already
>
;-)

> 14. Again in UpdateStatisticsForTypeChange(), would it not be better
> to do the statext_is_kind_built(oldtup, STATS_EXT_MCV) check before
> checking if the stats contain this column?  This gets rid of your
> reset_stats variable.
>
> I also don't quite understand why there's an additional check for
> statext_is_kind_built(oldtup, STATS_EXT_MCV), which if that's false
> then why do we do the dummy update on the tuple?
>
> Have you just coded this so that you can support other stats types
> later without too much modification? If so, I'm finding it a bit
> confusing to read, so maybe it's worth only coding it that way if
> there's more than one stats type to reset for.
>
> UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
> removed this already
;-)

>
> 15. I see you broke out the remainder of the code from
> clauselist_selectivity() into clauselist_selectivity_simple().  The
> comment looks like just a copy and paste from the original.  That
> seems like quite a bit of duplication. Is it better to maybe trim down
> the original one?
>
I'll see what I can do.

> 16. I initially didn't see how this code transformed the bms into anarray:

>
> /*
> * Transform the bms into an array, to make accessing i-th member easier,
> * and then construct a filtered version with only attnums referenced
> * by the dependency we validate.
> */
> attnums = build_attnums(attrs);
>
> attnums_dep = (int *)palloc(k * sizeof(int));
> for (i = 0; i < k; i++)
> attnums_dep[i] = attnums[dependency[i]];
>
> Would it be better to name build_attnums() build_attnums_array() ?
>
> I think it would also be better to, instead of saying "the bms", just
> say "attrs".
>
Hmmm, maybe.

> 17. dependencies_clauselist_selectivity(), in:
>
> if ((dependency_is_compatible_clause(clause, rel->relid, &attnum)) &&
> (!bms_is_member(listidx, *estimatedclauses)))
>
> would it be better to have the bms_is_member() first?
>
Yes, that might be a tad faster.

> 18. In dependencies_clauselist_selectivity() there seem to be a new
> bug introduced.  We do:
>
> /* mark this one as done, so we don't touch it again. */
> *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
>
> but the bms_is_member() check that skipped these has been removed.
>
> It might be easier to document if we just always do:
>
>    if (bms_is_member(listidx, *estimatedclauses))
> continue;
>
> at the start of both loops. list_attnums can just be left unset for
> the originally already estimatedclauses.
>

It's probably not as clear as it should be, but if the clause is alreadyestimated (or incompatible), then the list_attnums[] entry will beInvalidAttrNumber. Which is what we check in the second loop.


> 19. in extended_stats.c, should build_attnums() be documented that the
> Bitmapset members are not offset by
> FirstLowInvalidHeapAttributeNumber. I think mostly Bitmapsets of
> Attnums are offset by this, so might be worth a mention.
>
Good point.

> 20. I think bms_member_index() needs documentation.  I imagine you'll
> want to mention that the bitmapset must contain the given varattno,
> else surely it'll do the wrong thing if it's not. Perhaps an
> Assert(bms_is_member(keys, varattno)); should be added to it.
>

Agreed. Or maybe make it return -1 in that case? It might even havemissing_ok flag or something like that.


> 21. Comment does not really explain what the function does or what the
> arguments mean:
>
> /*
>   * statext_is_compatible_clause_internal
>   * Does the heavy lifting of actually inspecting the clauses for
>   * statext_is_compatible_clause.
>   */
>
Will improve.

> 22. In statext_is_compatible_clause_internal():
>
> /* Var = Const */
>
> The above comment seems a bit misplaced.  It looks like the code below
> it is looking for an OpExpr in the form of "Var <op> Const", or "Const
> <op> Var".
>
Yes, I agree.

> 23. statext_is_compatible_clause_internal() you have:
>
> if ((get_oprrest(expr->opno) != F_EQSEL) &&
> (get_oprrest(expr->opno) != F_NEQSEL) &&
> (get_oprrest(expr->opno) != F_SCALARLTSEL) &&
> (get_oprrest(expr->opno) != F_SCALARLESEL) &&
> (get_oprrest(expr->opno) != F_SCALARGTSEL) &&
> (get_oprrest(expr->opno) != F_SCALARGESEL))
> return false;
>
> 6 calls to get_oprrest(). 1 is enough.
>
> How does the existing MCV and histogram stats handle these operators?
> Does it insist on a btree opfamily, or is it as crude as this too?
>
It's this crude too, AFAICS.

> 24. In statext_is_compatible_clause_internal, you have:
>
> /* NOT/AND/OR clause */
> if (or_clause(clause) ||
> and_clause(clause) ||
> not_clause(clause))
> {
> /*
> * AND/OR/NOT-clauses are supported if all sub-clauses are supported
>
> Looks like you were not sure which order to have these, so you just
> tried a few variations :-D Maybe just make them all the same?
>
If you insist ;-)

> 25. Does statext_is_compatible_clause_internal)_ need to skip overRelabelTypes?

I believe it does, based on what I've observed during development. Whydo you think it's not necessary?


> 26. In statext_is_compatible_clause_internal() you mention: /* We only
> support plain Vars for now */, but I see nothing that ensures that
> only Vars are allowed in the is_opclause() condition.
>
> /* see if it actually has the right */
> ok = (NumRelids((Node *) expr) == 1) &&
> (is_pseudo_constant_clause(lsecond(expr->args)) ||
> (varonleft = false,
>    is_pseudo_constant_clause(linitial(expr->args))));
>
> the above would allow var+var == const through.
>

But then we call statext_is_compatible_clause_internal on it again, andthat only allows Vars and "Var op Const" expressions. Maybe there's away around that?


> The NumRelids seems like it would never have anything > 1 as you have
> a BMS_SINGLETON test on the RestrictInfo where you're calling this
> function from.  I think you likely want just a IsA(... , Var) checks
> here, after skipping over RelabelTypes.
> > Not sure what "/* see if it actually has the right */" means.
>
That should have been "right structure" I believe.

> 27. Should the function be named something more related to MCV?  The
> name makes it appear fairly generic to extended stats.
>
>   * statext_is_compatible_clause
>   * Determines if the clause is compatible with MCV lists.
>

No, because it's supposed to also handle histograms (and perhaps otherstats types) in the future.


> 28. This comment seems wrong:
>

> * Currently we only support Var = Const, or Const = Var. It may bepossible

>   * to expand on this later.
>
> I see you're allowing IS NULL and IS NOT NULL too.  = does not seem to
> be required either.
>
OK, will fix.

> 29. The following fragment makes me think we're only processing
> clauses to use them with MCV lists, but the comment claims "dependency
> selectivity estimations"
>
> /* we're interested in MCV lists */
> int types = STATS_EXT_MCV;
>
> /* check if there's any stats that might be useful for us. */
> if (!has_stats_of_kind(rel->statlist, types))
> return (Selectivity) 1.0;
>
> list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
> list_length(clauses));
>
> /*
> * Pre-process the clauses list to extract the attnums seen in each item.
> * We need to determine if there's any clauses which will be useful for
> * dependency selectivity estimations. Along the way we'll record all of
>
Yeah, that's copy-pasto.

> 30. Is it better to do the bms_is_member() first here?
>
> if ((statext_is_compatible_clause(clause, rel->relid, &attnums)) &&
> (!bms_is_member(listidx, *estimatedclauses)))
>
> Likely it'll be cheaper.
>
Yeah, same as before.

> 31. I think this comment should be /* Ensure choose_best_statistics()
> didn't mess up */
>
> /* We only understand MCV lists for now. */
> Assert(stat->kind == STATS_EXT_MCV);
>
I'll expand the comment a bit.

> 32. What're lags?
>
> bool    *isnull; /* lags of NULL values (up to 32 columns) */
>
Should be "flags" I think.

> 33. "ndimentions"? There's no field in the struct by that name. I'd
> assume it's the same size as the isnull array above it?
>
> Datum    *values; /* variable-length (ndimensions) */
>
Yes, that's the case.

> 34. README.mcv
>
> * large -> a large
>

> For columns with large number of distinct values (e.g. those withcontinuous

>
> * Is the following up-to-date?  I thought I saw code for NOT too?
>
>      (a) equality clauses    WHERE (a = 1) AND (b = 2)
>      (b) inequality clauses  WHERE (a < 1) AND (b >= 2)
>      (c) NULL clauses        WHERE (a IS NULL) AND (b IS NOT NULL)
>      (d) OR clauses          WHERE (a < 1) OR (b >= 2)
>
> * multi-variate -> multivariate
>

> are large the list may be quite large. This is especially true formulti-variate

>
> * a -> an
>

> TODO Currently there's no logic to consider building only a MCV list(and not

>
> * I'd have said "an SRF", but git grep "a SRF" disagrees with me. I
> guess those people must be pronouncing it, somehow!? surf... serf... ?
>
> easier, there's a SRF returning detailed information about the MCV lists.
>
> * Is it better to put a working SQL in here?
>
> SELECT * FROM pg_mcv_list_items(stxmcv);
>
> maybe like:
>
> SELECT s.* FROM pg_statistic_ext, LATERAL pg_mcv_list_items(stxmcv) s;
>
> Maybe with a WHERE clause?
>
> * This list seems outdated.
>
>      - item index (0, ..., (nitems-1))
>      - values (string array)
>      - nulls only (boolean array)
>      - frequency (double precision)
>
> base_frequency seems to exist now too.
>
Yeah, those are mostly typos. Will fix.


thanks

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to