Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Tomas Vondra Sat, 25 Nov 2017 15:34:40 -0800


On 11/25/2017 10:01 PM, Mark Dilger wrote:
> 
>> On Nov 18, 2017, at 12:28 PM, Tomas Vondra <[email protected]> 
>> wrote:
>>
>> Hi,
>>
>> Attached is an updated version of the patch, adopting the psql describe
>> changes introduced by 471d55859c11b.
> 
> Hi Tomas,
> 
> In src/backend/statistics/dependencies.c, you have introduced a comment:
> 
> +       /*
> +        * build an array of SortItem(s) sorted using the multi-sort support
> +        *
> +        * XXX This relies on all stats entries pointing to the same tuple
> +        * descriptor. Not sure if that might not be the case.
> +        */
> 
> Would you mind explaining that a bit more for me?  I don't understand exactly 
> what
> you mean here, but it sounds like the sort of thing that needs to be 
> clarified/fixed
> before it can be committed.  Am I misunderstanding this?
>


The call right after that comment is

    items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
                               mss, k, attnums_dep);

That method processes an array of tuples, and the structure is defined
by "tuple descriptor" (essentially a list of attribute info - data type,
length, ...). We get that from stats[0] and assume all the entries point
to the same tuple descriptor. That's generally safe assumption, I think,
because all the stats entries relate to columns from the same table.

> 
> In src/backend/statistics/mcv.c, you have comments:
> 
> + * FIXME: Single-dimensional MCV is sorted by frequency (descending). We
> + * should do that too, because when walking through the list we want to
> + * check the most frequent items first.
> + *
> + * TODO: We're using Datum (8B), even for data types (e.g. int4 or float4).
> + * Maybe we could save some space here, but the bytea compression should
> + * handle it just fine.
> + *
> + * TODO: This probably should not use the ndistinct directly (as computed 
> from
> + * the table, but rather estimate the number of distinct values in the
> + * table), no?
> 
> Do you intend these to be fixed/implemented prior to committing this patch?
> 

Actually, the first FIXME is obsolete, as build_distinct_groups returns
the groups sorted by frequency. I'll remove that.

I think the rest is more a subject for discussion, so I'd need to hear
some feedback.

> 
> Further down in function statext_mcv_build, you have two loops, the first 
> allocating
> memory and the second initializing the memory.  There is no clear reason why 
> this
> must be done in two loops.  I tried combining the two loops into one, and it 
> worked
> just fine, but did not look any cleaner to me.  Feel free to disregard this 
> paragraph
> if you like it better the way you currently have it organized.
> 

I did it this way because of readability. I don't think this is a major
efficiency issue, as the maximum number of items is fairly limited, and
it happens only once at the end of the MCV list build (and the sorts and
comparisons are likely much more CPU expensive).

> 
> Further down in statext_mcv_deserialize, you have some elogs which might need 
> to be
> ereports.  It is unclear to me whether you consider these deserialize error 
> cases to be
> "can't happen" type errors.  If so, you might add that fact to the comments 
> rather than
> changing the elogs to ereports.
> 

I might be missing something, but why would ereport be more appropriate
than elog? Ultimately, there's not much difference between elog(ERROR)
and ereport(ERROR) - both will cause a failure.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to