Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Tomas Vondra Sun, 17 Mar 2019 06:19:49 -0700

Hi,

On 3/17/19 12:47 PM, Dean Rasheed wrote:
> On Sat, 16 Mar 2019 at 23:44, Tomas Vondra <tomas.von...@2ndquadrant.com> 
> wrote:
>>
>>> 28). I just spotted the 1MB limit on the serialised MCV list size. I
>>> think this is going to be too limiting. For example, if the stats
>>> target is at its maximum of 10000, that only leaves around 100 bytes
>>> for each item's values, which is easily exceeded. In fact, I think
>>> this approach for limiting the MCV list size isn't a good one --
>>> consider what would happen if there were lots of very large values.
>>> Would it run out of memory before getting to that test? Even if not,
>>> it would likely take an excessive amount of time.
>>>
>>
>> True. I don't have a very good argument for a specific value, or even
>> having an explicit limit at all. I've initially added it mostly as a
>> safety for development purposes, but I think you're right we can just
>> get rid of it. I don't think it'd run out of memory before hitting the
>> limit, but I haven't tried very hard (but I recall running into the 1MB
>> limit in the past).
>>
> 
> I've just been playing around a little with this and found that it
> isn't safely dealing with toasted values. For example, consider the
> following test:
> 
> create or replace function random_string(x int) returns text
> as $$
>   select substr(string_agg(md5(random()::text), ''), 1, x)
>     from generate_series(1,(x+31)/32);
> $$ language sql;
> 
> drop table if exists t;
> create table t(a int, b text);
> insert into t values (1, random_string(10000000));
> create statistics s (mcv) on a,b from t;
> analyse t;
> 
> select length(b), left(b,5), right(b,5) from t;
> select length(stxmcv), length((m.values::text[])[2]),
>        left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
>   from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
>  where stxrelid = 't'::regclass;
> 
> The final query returns the following:
> 
>  length |  length  | left  | right
> --------+----------+-------+-------
>     250 | 10000000 | c2667 | 71492
> (1 row)
> 
> suggesting that there's something odd about the stxmcv value. Note,
> also, that it doesn't hit the 1MB limit, even though the value is much
> bigger than that.
> 
> If I then delete the value from the table, without changing the stats,
> and repeat the final query, it falls over:
> 
> delete from t where a=1;
> select length(stxmcv), length((m.values::text[])[2]),
>        left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
>   from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
>  where stxrelid = 't'::regclass;
> 
> ERROR:  unexpected chunk number 5008 (expected 0) for toast value
> 16486 in pg_toast_16480
> 
> So I suspect it was using the toast data from the table t, although
> I've not tried to investigate further.
>


Yes, it was using the toasted value directly. The attached patch
detoasts the value explicitly, similarly to the per-column stats, and it
also removes the 1MB limit.

> 
>>> I think this part of the patch needs a bit of a rethink. My first
>>> thought is to do something similar to what happens for per-column
>>> MCVs, and set an upper limit on the size of each value that is ever
>>> considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
>>> toowide_cnt in analyse.c). Over-wide values should be excluded early
>>> on, and it will need to track whether or not any such values were
>>> excluded, because then it wouldn't be appropriate to treat the stats
>>> as complete and keep the entire list, without calling
>>> get_mincount_for_mcv_list().
>>>
>> Which part? Serialization / deserialization? Or how we handle long
>> values when building the MCV list?
>>
> 
> I was thinking (roughly) of something like the following:
> 
> * When building the values array for the MCV list, strip out rows with
> values wider than some threshold (probably something like the
> WIDTH_THRESHOLD = 1024 from analyse.c would be reasonable).
> 
> * When building the MCV list, if some over-wide values were previously
> stripped out, always go into the get_mincount_for_mcv_list() block,
> even if nitems == ngroups (for the same reason a similar thing happens
> for per-column stats -- if some items were stripped out, we're already
> saying that not all items will go in the MCV list, and it's not safe
> to assume that the remaining items are common enough to give accurate
> estimates).
> 

Yes, that makes sense I guess.

> * In the serialisation code, remove the size limit entirely. We know
> that each value is now at most 1024 bytes, and there are at most 10000
> items, and at most 8 columns, so the total size is already reasonably
> well bounded. In the worst case, it might be around 80MB, but in
> practice, it's always likely to be much much smaller than that.
> 

Yep, I've already removed the limit from the current patch.


cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-retain-reltuples-relpages-on-ALTER-TABLE-20190317b.patch.gz
Description: application/gzip

0002-multivariate-MCV-lists-20190317b.patch.gz
Description: application/gzip

0003-multivariate-histograms-20190317b.patch.gz
Description: application/gzip

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Reply via email to