Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Tomas Vondra Thu, 28 Feb 2019 11:57:29 -0800

Hi,

Attached is an updated version of this patch series. I've decided to
rebase and send both parts (MCV and histograms), although we've agreed
to focus on the MCV part for now. I don't want to leave the histogram to
lag behind, because (a) then it'd be much more work to update it, and
(b) I think it's an useful feedback about likely future changes.


This should address most of the issues pointed out by David in his
recent reviews. Briefly:

1) It fixes/updates a number of comments and docs on various places,
removes redundant comments etc. In most cases I've simply adopted the
wording proposed by David, with minor tweaks in a couple of cases.

2) Reverts changes that exposed analyze_mcv_list - this was a leftover
from the attempt to reuse the single-column algorithm, but we've since
agreed it's not the right approach. So this change is unnecessary.

3) I've tweaked the code to accept RelabelType nodes as supported,
similarly to what examine_variable() does. Previously I concluded we
can't support RelabelType, but it seems that reasoning was bogus. I've
slightly tweaked the regression tests by changing one of the columns to
varchar, so that the queries actualy trigger this.

4) I've tweaked a couple of places (UpdateStatisticsForTypeChange,
statext_clauselist_selectivity and estimate_num_groups_simple) per
David's suggestions. Those were fairly straightforward simplifications.

5) I've removed mcv_count from statext_mcv_build(). As David pointed
out, this was not actually needed - it was another remnant of the
attempt to re-use analyze_mcv_list() which needs such array. But without
it we can access the groups directly.

6) One of the review questions was about the purpose of this code:

  for (i = 0; i < nitems; i++)
  {
      if (groups[i].count < mincount)
      {
          nitems = i;
          break;
      }
  }

It's quite simple - we want to include groups with more occurrences than
mincount, and the groups are sorted by the count (in descending order).
So we simply find the first group with count below mincount, and the
index is the number of groups to keep. I've tried to explain that in a
comment.

7) I've fixed a bunch of format patters in statext_mcv_deserialize(),
particularly those that confused %d and %u. We can't however use %d for
VARSIZE_ANY_EXHDR, because that macro expands into offsetof() etc. So
that would trigger compiler warnings.

8) Yeah, pg_stats_ext_mcvlist_items was broken. The issue was that one
of the output parameters is defined as boolean[], but the function was
building just string. Originally it used BuildTupleFromCStrings(), but
then it switched to heap_form_tuple() without building a valid array.

I've decided to simply revert back to BuildTupleFromCStrings(). It's not
going to be used very frequently, so the small performance difference is
not important.

I've also fixed the formatting issues, pointed out by David.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-multivariate-MCV-lists-20190228.patch.gz
Description: application/gzip

0002-multivariate-histograms-20190228.patch.gz
Description: application/gzip

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to