On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <to...@vondra.me> wrote:
> Hi, > > Thanks for continuing to work on this. > > On 1/22/25 19:17, Corey Huinker wrote: > > This is a separate thread for work started in [1] but focused purely on > > getting the following functions working: > > > > * pg_set_extended_stats > > * pg_clear_extended_stats > > * pg_restore_extended_stats > > > > These functions are analogous to their relation/attribute counterparts, > > use the same calling conventions, and build upon the same basic > > infrastructure. > > > > I think it is important that we get these implemented because they close > > the gap that was left in terms of the ability to modify existing > > statistics and to round out the work being done to carry over statistics > > via dump/restore and pg_upgrade i [1]. > > > > The purpose of each patch is as follows (adapted from previous thread): > > > > 0001 - This makes the input function for pg_ndistinct functional. > > > > 0002 - This makes the input function for pg_dependencies functional. > > > > I only quickly skimmed the patches, but a couple comments: > > > 1) I think it makes perfect sense to use the JSON parsing for the input > functions, but maybe it'd be better to adjust the format a bit to make > that even easier? > > Right now the JSON "keys" have structure, which means we need some ad > hoc parsing. Maybe we should make it "proper JSON" by moving that into > separate key/value, e.g. for ndistinct we might replace this: > > {"1, 2": 2323, "1, 3" : 3232, ...} > > with this: > > [ {"keys": [1, 2], "ndistinct" : 2323}, > {"keys": [1, 3], "ndistinct" : 3232}, > ... ] > > so a regular JSON array of objects, with keys an "array". And similarly > for dependencies. > That is almost exactly what I did back when the stats import functions took a nested JSON argument. The biggest problem with changing that format is that the old format would still show up in the system being exported, so we would have to process that format as well as the new one. > Yes, it's more verbose, but maybe better for "mechanical" processing? > It absolutely would be better for processing, but we'd still have to read the old format from older systems. I suppose the pg_dump code could do some SQL gymnastics to convert the old json-but-sad format into the processing-friendly format of the future, and I could easily adapt what I've already written over a year ago to that task. I suppose it's just a matter of having the community behind changing the output format to enable a better input format. > 2) Do we need some sort of validation? Perhaps this was discussed in the > other thread and I missed that, but isn't it a problem that happily > accept e.g. this? > > {"6666, 6666" : 1, "1, -222": 14, ...} > > That has duplicate keys with bogus attribute numbers, stats on (bogus) > system attributes, etc. I suspect this may easily cause problems during > planning (if it happens to visit those statistics). > We used to have _lots_ of validation for data quality issues, much of which was removed at the request of reviewers. However, much of that discussion was about the health of the statistics, but these are standalone data types, maybe they're held to a higher standard. If so, what sort of checks do you think would be needed and/or wanted? So far, I can imagine the following: * no negative attnums in key list * no duplicate attnums in key list anything else? > > Maybe that's acceptable - ultimately the user could import something > broken in a much subtler way, of course. But the pg_set_attribute_stats > seems somewhat more protected against this, because it gets the attr as > a separate argument. > The datatype itself is in isolation, but once we've created a valid pg_ndistinct or pg_dependencies, there's nothing stopping us from validating the values in the datatype against the statistics object and the relation it belongs to, but that might get the same resistance that I got to say, ensuring that frequency lists were monotonically decreasing. > I recall I wished to have the attnum in the output function, but that > was not quite possible because we don't know the relid (and thus the > descriptor) in that function. > > Is it a good idea to rely on the input/output format directly? How will > that deal with cross-version differences? Doesn't it mean the in/out > format is effectively fixed, or at least has to be backwards compatible > (i.e. new version has to handle any value from older versions)? > Presently there are no cross-version differences, though earlier I address the pros and cons of changing it. No matter what, the burden of having a valid format is on the user in the case of pg_set_extended_stats() and pg_restore_extended_stats() has a server version number associated, so the option of handling a format change could be baked in, but then we're doing version tests and input typecasts like we do with ANYARRAY types. Not impossible, but definitely more work. > Or what if I want to import the stats for a table with slightly > different structure (e.g. because dump/restore skips dropped columns). > Won't that be a problem with the format containing raw attnums? Or is > this a use case we don't expect to work? > The family of pg_set_*_stats functions expect the input to be meaningful and correct for that relation on that server version. Any attnum translation would have to be done by the user to adapt to the new or changed relation. The family of pg_restore_*_stats functions are designed to be forward compatible, and to work across versions but for basically the same relation or relation of the same shape. Basically, they're for pg_restore and pg_upgrade, so no changes in attnums would be expected. > > For the per-attribute stats it's probably fine, because that's mostly > just a collection of regular data types (scalar values or arrays of > values, ...) and we're not modifying them except for maybe adding new > fields. But extended stats seem more complex, so maybe it's different? > I had that working by attname matching way back in the early days, but it would involve creating an intermediate format for translating the attnums to attnames on export, and then re-translating them on the way back in. I suppose someone could write the following utility functions pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -> json pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) -> pg_ndistinct and that would bridge the gap for the special case where you want to adapt pg_ndistinct from one table structure to a slightly different one. > > I remember a long discussion about the format at the very beginning of > this patch series, and the conclusion clearly was to have a function > that import stats for one attribute at a time. And that seems to be > working fine, but extended stats values have more internal structure, so > perhaps they need to do something more complicated. > I believe this *is* doing something more complicated, especially with the MCVList, though I was very pleased to see how much of the existing infrastructure I was able to reuse. > > > 0003 - Makes several static functions in attribute_stats.c public for use > > by extended stats. One of those is get_stat_attr_type(), which in the > last > > patchset was modified to take an attribute name rather than attnum, thus > > saving a syscache lookup. However, extended stats identifies attributes > by > > attnum not name, so that optimization had to be set aside, at least > > temporarily. > > > > 0004 - These implement the functions pg_set_extended_stats(), > > pg_clear_extended_stats(), and pg_restore_extended_stats() and behave > like > > their relation/attribute equivalents. If we can get these committed and > > used by pg_dump, then we don't have to debate how to handle post-upgrade > > steps for users who happen to have extended stats vs the approximately > > 99.75% of users who do not have extended stats. > > > > I see there's a couple MCV-specific functions in the extended_stats.c. > Shouldn't those go into mvc.c instead? > I wanted to put it there, but there was a reason I didn't and I've now forgotten what it was. I'll make an effort to relocate it to mcv.c. For that matter, it might make sense to break out the expressions code into its own file, because every other stat attribute has its own. Thoughts on that? > > FWIW there's a bunch of whitespace issues during git apply. > +1 > > OK. Thanks for the patch! > Thanks for the feedback, please keep it coming. I think it's really important that extended stats, though used rarely, be included in our dump/restore/upgrade changes so as to make for a more consistent user experience.