On Tue, 15 Jan 2019 at 08:21, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Turns out you were right - the attribute_referenced piece was quite > unnecessary. So I've removed it. I've also extended the regression tests > to verify changing type of another column does not reset the stats.
(Trying to find my feet over here) I've read over the entire thread, and apart from missing the last two emails and therefore the latest patch, I managed to read over most of the MCV patch. I didn't quite get to reading mcv.c and don't quite have the energy to take that on now. At this stage I'm trying to get to know the patch. I read a lot of discussing between you and Dean ironing out how the stats should be used to form selectivities. At the time I'd not read the patch yet, so most of it went over my head. I did note down a few things on my read. I've included them below. Hopefully, they're useful. MCV list review 1. In mvc.c there's Assert(ndistinct <= UINT16_MAX); This should be PG_UINT16_MAX 2. math.h should be included just after postgres.h 3. Copyright is still -2017 in mcv.c. Hopefully, if you change it to 2019, you'll never have to bump it ever again! :-) 4. Looking at pg_stats_ext_mcvlist_items() I see you've coded the string building manually. The way it's coded I'm finding a little strange. It means the copying becomes quadratic due to snprintf(buff, 1024, format, values[1], DatumGetPointer(valout)); strncpy(values[1], buff, 1023); So basically, generally, here you're building a new string with values[1] followed by a comma, then followed by valout. One the next line you then copy that new buffer back into values[1]. I understand this part is likely not performance critical, but I see no reason to write the code this way. Are you limiting the strings to 1024 bytes on purpose? I don't see any comment mentioning you want to truncate strings. Would it not be better to do this part using a AppendStringInfoString()? and just manually add a '{', ',' or '}' as and when required? DatumGetPointer(valout) should really be using DatumGetCString(valout). Likely you can also use heap_form_tuple. This will save you having to convert ints into strings then only to have BuildTupleFromCStrings() do the reverse. 5. individiaul -> individual lists. This allows very accurate estimates for individiaul columns, but litst -> lists litst on combinations of columns. Similarly to functional dependencies 6. Worth mentioning planning cycles too? "It's advisable to create <literal>MCV</literal> statistics objects only on combinations of columns that are actually used in conditions together, and for which misestimation of the number of groups is resulting in bad plans. Otherwise, the <command>ANALYZE</command> cycles are just wasted." 7. straight-forward -> straightforward (most-common values) lists, a straight-forward extension of the per-column 8. adresses -> addresses statistics adresses the limitation by storing individual values, but it 9. Worth mentioning ANALYZE time? This section introduces multivariate variant of <acronym>MCV</acronym> (most-common values) lists, a straight-forward extension of the per-column statistics described in <xref linkend="row-estimation-examples"/>. This statistics adresses the limitation by storing individual values, but it is naturally more expensive, both in terms of storage and planning time. 10. low -> a low with low number of distinct values. Before looking at the second query, 11. them -> then on items in the <acronym>MCV</acronym> list, and them sums the frequencies 12. Should we be referencing the source from the docs? See <function>mcv_clauselist_selectivity</function> in <filename>src/backend/statistics/mcv.c</filename> for details. hmm. I see it's not the first going by: git grep -E "\w+\.c\<" 13. Pretty minor, but the following loop in UpdateStatisticsForTypeChange() could use a break; attribute_referenced = false; for (i = 0; i < staForm->stxkeys.dim1; i++) if (attnum == staForm->stxkeys.values[i]) attribute_referenced = true; UPDATE: If I'd reviewed the correct patch I'd have seen that you'd removed this already 14. Again in UpdateStatisticsForTypeChange(), would it not be better to do the statext_is_kind_built(oldtup, STATS_EXT_MCV) check before checking if the stats contain this column? This gets rid of your reset_stats variable. I also don't quite understand why there's an additional check for statext_is_kind_built(oldtup, STATS_EXT_MCV), which if that's false then why do we do the dummy update on the tuple? Have you just coded this so that you can support other stats types later without too much modification? If so, I'm finding it a bit confusing to read, so maybe it's worth only coding it that way if there's more than one stats type to reset for. UPDATE: If I'd reviewed the correct patch I'd have seen that you'd removed this already 15. I see you broke out the remainder of the code from clauselist_selectivity() into clauselist_selectivity_simple(). The comment looks like just a copy and paste from the original. That seems like quite a bit of duplication. Is it better to maybe trim down the original one? 16. I initially didn't see how this code transformed the bms into an array: /* * Transform the bms into an array, to make accessing i-th member easier, * and then construct a filtered version with only attnums referenced * by the dependency we validate. */ attnums = build_attnums(attrs); attnums_dep = (int *)palloc(k * sizeof(int)); for (i = 0; i < k; i++) attnums_dep[i] = attnums[dependency[i]]; Would it be better to name build_attnums() build_attnums_array() ? I think it would also be better to, instead of saying "the bms", just say "attrs". 17. dependencies_clauselist_selectivity(), in: if ((dependency_is_compatible_clause(clause, rel->relid, &attnum)) && (!bms_is_member(listidx, *estimatedclauses))) would it be better to have the bms_is_member() first? 18. In dependencies_clauselist_selectivity() there seem to be a new bug introduced. We do: /* mark this one as done, so we don't touch it again. */ *estimatedclauses = bms_add_member(*estimatedclauses, listidx); but the bms_is_member() check that skipped these has been removed. It might be easier to document if we just always do: if (bms_is_member(listidx, *estimatedclauses)) continue; at the start of both loops. list_attnums can just be left unset for the originally already estimatedclauses. 19. in extended_stats.c, should build_attnums() be documented that the Bitmapset members are not offset by FirstLowInvalidHeapAttributeNumber. I think mostly Bitmapsets of Attnums are offset by this, so might be worth a mention. 20. I think bms_member_index() needs documentation. I imagine you'll want to mention that the bitmapset must contain the given varattno, else surely it'll do the wrong thing if it's not. Perhaps an Assert(bms_is_member(keys, varattno)); should be added to it. 21. Comment does not really explain what the function does or what the arguments mean: /* * statext_is_compatible_clause_internal * Does the heavy lifting of actually inspecting the clauses for * statext_is_compatible_clause. */ 22. In statext_is_compatible_clause_internal(): /* Var = Const */ The above comment seems a bit misplaced. It looks like the code below it is looking for an OpExpr in the form of "Var <op> Const", or "Const <op> Var". 23. statext_is_compatible_clause_internal() you have: if ((get_oprrest(expr->opno) != F_EQSEL) && (get_oprrest(expr->opno) != F_NEQSEL) && (get_oprrest(expr->opno) != F_SCALARLTSEL) && (get_oprrest(expr->opno) != F_SCALARLESEL) && (get_oprrest(expr->opno) != F_SCALARGTSEL) && (get_oprrest(expr->opno) != F_SCALARGESEL)) return false; 6 calls to get_oprrest(). 1 is enough. How does the existing MCV and histogram stats handle these operators? Does it insist on a btree opfamily, or is it as crude as this too? 24. In statext_is_compatible_clause_internal, you have: /* NOT/AND/OR clause */ if (or_clause(clause) || and_clause(clause) || not_clause(clause)) { /* * AND/OR/NOT-clauses are supported if all sub-clauses are supported Looks like you were not sure which order to have these, so you just tried a few variations :-D Maybe just make them all the same? 25. Does statext_is_compatible_clause_internal)_ need to skip over RelabelTypes? 26. In statext_is_compatible_clause_internal() you mention: /* We only support plain Vars for now */, but I see nothing that ensures that only Vars are allowed in the is_opclause() condition. /* see if it actually has the right */ ok = (NumRelids((Node *) expr) == 1) && (is_pseudo_constant_clause(lsecond(expr->args)) || (varonleft = false, is_pseudo_constant_clause(linitial(expr->args)))); the above would allow var+var == const through. The NumRelids seems like it would never have anything > 1 as you have a BMS_SINGLETON test on the RestrictInfo where you're calling this function from. I think you likely want just a IsA(... , Var) checks here, after skipping over RelabelTypes. Not sure what "/* see if it actually has the right */" means. 27. Should the function be named something more related to MCV? The name makes it appear fairly generic to extended stats. * statext_is_compatible_clause * Determines if the clause is compatible with MCV lists. 28. This comment seems wrong: * Currently we only support Var = Const, or Const = Var. It may be possible * to expand on this later. I see you're allowing IS NULL and IS NOT NULL too. = does not seem to be required either. 29. The following fragment makes me think we're only processing clauses to use them with MCV lists, but the comment claims "dependency selectivity estimations" /* we're interested in MCV lists */ int types = STATS_EXT_MCV; /* check if there's any stats that might be useful for us. */ if (!has_stats_of_kind(rel->statlist, types)) return (Selectivity) 1.0; list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) * list_length(clauses)); /* * Pre-process the clauses list to extract the attnums seen in each item. * We need to determine if there's any clauses which will be useful for * dependency selectivity estimations. Along the way we'll record all of 30. Is it better to do the bms_is_member() first here? if ((statext_is_compatible_clause(clause, rel->relid, &attnums)) && (!bms_is_member(listidx, *estimatedclauses))) Likely it'll be cheaper. 31. I think this comment should be /* Ensure choose_best_statistics() didn't mess up */ /* We only understand MCV lists for now. */ Assert(stat->kind == STATS_EXT_MCV); 32. What're lags? bool *isnull; /* lags of NULL values (up to 32 columns) */ 33. "ndimentions"? There's no field in the struct by that name. I'd assume it's the same size as the isnull array above it? Datum *values; /* variable-length (ndimensions) */ 34. README.mcv * large -> a large For columns with large number of distinct values (e.g. those with continuous * Is the following up-to-date? I thought I saw code for NOT too? (a) equality clauses WHERE (a = 1) AND (b = 2) (b) inequality clauses WHERE (a < 1) AND (b >= 2) (c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL) (d) OR clauses WHERE (a < 1) OR (b >= 2) * multi-variate -> multivariate are large the list may be quite large. This is especially true for multi-variate * a -> an TODO Currently there's no logic to consider building only a MCV list (and not * I'd have said "an SRF", but git grep "a SRF" disagrees with me. I guess those people must be pronouncing it, somehow!? surf... serf... ? easier, there's a SRF returning detailed information about the MCV lists. * Is it better to put a working SQL in here? SELECT * FROM pg_mcv_list_items(stxmcv); maybe like: SELECT s.* FROM pg_statistic_ext, LATERAL pg_mcv_list_items(stxmcv) s; Maybe with a WHERE clause? * This list seems outdated. - item index (0, ..., (nitems-1)) - values (string array) - nulls only (boolean array) - frequency (double precision) base_frequency seems to exist now too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services