Hi, On 3/17/19 12:47 PM, Dean Rasheed wrote: > On Sat, 16 Mar 2019 at 23:44, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: >> >>> 28). I just spotted the 1MB limit on the serialised MCV list size. I >>> think this is going to be too limiting. For example, if the stats >>> target is at its maximum of 10000, that only leaves around 100 bytes >>> for each item's values, which is easily exceeded. In fact, I think >>> this approach for limiting the MCV list size isn't a good one -- >>> consider what would happen if there were lots of very large values. >>> Would it run out of memory before getting to that test? Even if not, >>> it would likely take an excessive amount of time. >>> >> >> True. I don't have a very good argument for a specific value, or even >> having an explicit limit at all. I've initially added it mostly as a >> safety for development purposes, but I think you're right we can just >> get rid of it. I don't think it'd run out of memory before hitting the >> limit, but I haven't tried very hard (but I recall running into the 1MB >> limit in the past). >> > > I've just been playing around a little with this and found that it > isn't safely dealing with toasted values. For example, consider the > following test: > > create or replace function random_string(x int) returns text > as $$ > select substr(string_agg(md5(random()::text), ''), 1, x) > from generate_series(1,(x+31)/32); > $$ language sql; > > drop table if exists t; > create table t(a int, b text); > insert into t values (1, random_string(10000000)); > create statistics s (mcv) on a,b from t; > analyse t; > > select length(b), left(b,5), right(b,5) from t; > select length(stxmcv), length((m.values::text[])[2]), > left((m.values::text[])[2], 5), right((m.values::text[])[2],5) > from pg_statistic_ext, pg_mcv_list_items(stxmcv) m > where stxrelid = 't'::regclass; > > The final query returns the following: > > length | length | left | right > --------+----------+-------+------- > 250 | 10000000 | c2667 | 71492 > (1 row) > > suggesting that there's something odd about the stxmcv value. Note, > also, that it doesn't hit the 1MB limit, even though the value is much > bigger than that. > > If I then delete the value from the table, without changing the stats, > and repeat the final query, it falls over: > > delete from t where a=1; > select length(stxmcv), length((m.values::text[])[2]), > left((m.values::text[])[2], 5), right((m.values::text[])[2],5) > from pg_statistic_ext, pg_mcv_list_items(stxmcv) m > where stxrelid = 't'::regclass; > > ERROR: unexpected chunk number 5008 (expected 0) for toast value > 16486 in pg_toast_16480 > > So I suspect it was using the toast data from the table t, although > I've not tried to investigate further. >
Yes, it was using the toasted value directly. The attached patch detoasts the value explicitly, similarly to the per-column stats, and it also removes the 1MB limit. > >>> I think this part of the patch needs a bit of a rethink. My first >>> thought is to do something similar to what happens for per-column >>> MCVs, and set an upper limit on the size of each value that is ever >>> considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and >>> toowide_cnt in analyse.c). Over-wide values should be excluded early >>> on, and it will need to track whether or not any such values were >>> excluded, because then it wouldn't be appropriate to treat the stats >>> as complete and keep the entire list, without calling >>> get_mincount_for_mcv_list(). >>> >> Which part? Serialization / deserialization? Or how we handle long >> values when building the MCV list? >> > > I was thinking (roughly) of something like the following: > > * When building the values array for the MCV list, strip out rows with > values wider than some threshold (probably something like the > WIDTH_THRESHOLD = 1024 from analyse.c would be reasonable). > > * When building the MCV list, if some over-wide values were previously > stripped out, always go into the get_mincount_for_mcv_list() block, > even if nitems == ngroups (for the same reason a similar thing happens > for per-column stats -- if some items were stripped out, we're already > saying that not all items will go in the MCV list, and it's not safe > to assume that the remaining items are common enough to give accurate > estimates). > Yes, that makes sense I guess. > * In the serialisation code, remove the size limit entirely. We know > that each value is now at most 1024 bytes, and there are at most 10000 > items, and at most 8 columns, so the total size is already reasonably > well bounded. In the worst case, it might be around 80MB, but in > practice, it's always likely to be much much smaller than that. > Yep, I've already removed the limit from the current patch. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0001-retain-reltuples-relpages-on-ALTER-TABLE-20190317b.patch.gz
Description: application/gzip
0002-multivariate-MCV-lists-20190317b.patch.gz
Description: application/gzip
0003-multivariate-histograms-20190317b.patch.gz
Description: application/gzip