My apologies for having paid so little attention to this thread for months. I got around to reading the v15 patches today, and while I think they're going in more or less the right direction, there's a long way to go IMO.
I concur with the plan of extracting data from pg_stats not pg_statistics, and with emitting a single "set statistics" call per attribute. (I think at one point I'd suggested a call per stakind slot, but that would lead to a bunch of UPDATEs on existing pg_attribute tuples and hence a bunch of dead tuples at the end of an import, so it's not the way to go. A series of UPDATEs would likely also play poorly with a background auto-ANALYZE happening concurrently.) I do not like the current design for pg_set_attribute_stats' API though: I don't think it's at all future-proof. What happens when somebody adds a new stakind (and hence new pg_stats column)? You could try to add an overloaded pg_set_attribute_stats version with more parameters, but I'm pretty sure that would lead to "ambiguous function call" failures when trying to load old dump files containing only the original parameters. The present design is also fragile in that an unrecognized parameter will lead to a parse-time failure and no function call happening, which is less robust than I'd like. As lesser points, the relation argument ought to be declared regclass not oid for convenience of use, and I really think that we need to provide the source server's major version number --- maybe we will never need that, but if we do and we don't have it we will be sad. So this leads me to suggest that we'd be best off with a VARIADIC ANY signature, where the variadic part consists of alternating parameter labels and values: pg_set_attribute_stats(table regclass, attribute name, inherited bool, source_version int, variadic "any") returns void where a call might look like SELECT pg_set_attribute_stats('public.mytable', 'mycolumn', false, -- not inherited 16, -- source server major version -- pairs of labels and values follow 'null_frac', 0.4, 'avg_width', 42, 'histogram_bounds', array['a', 'b', 'c']::text[], ...); Note a couple of useful things here: * AFAICS we could label the function strict and remove all those ad-hoc null checks. If you don't have a value for a particular stat, you just leave that pair of arguments out (exactly as the existing code in 0002 does, just using a different notation). This also means that we don't need any default arguments and so no need for hackery in system_functions.sql. * If we don't recognize a parameter label at runtime, we can treat that as a warning rather than a hard error, and press on. This case would mostly be useful in major version downgrades I suppose, but that will be something people will want eventually. * We can require the calling statement to cast arguments, particularly arrays, to the proper type, removing the need for conversions within the stats-setting function. (But instead, it'd need to check that the next "any" argument is the type it ought to be based on the type of the target column.) If we write the labels as undecorated string literals as I show above, I think they will arrive at the function as "unknown"-type constants, which is a little weird but doesn't seem like it's really a big problem. The alternative is to cast them all to text explicitly, but that's adding notation to no great benefit IMO. pg_set_relation_stats is simpler in that the set of stats values to be set will probably remain fairly static, and there seems little reason to allow only part of them to be supplied (so personally I'd drop the business about accepting nulls there too). If we do grow another value or values for it to set there shouldn't be much problem with overloading it with another version with more arguments. Still needs to take regclass not oid though ... I've not read the patches in great detail, but I did note a few low-level issues: * why is check_relation_permissions looking up the pg_class row? There's already a copy of that in the Relation struct. Likewise for the other caller of can_modify_relation (but why is that caller not using check_relation_permissions?) That all looks overly complicated and duplicative. I think you don't need two layers of function there. * I find the stuff with enums and "const char *param_names" to be way too cute and unlike anything we do elsewhere. Please don't invent your own notations for coding patterns that have hundreds of existing instances. pg_set_relation_stats, for example, has absolutely no reason not to look like the usual Oid relid = PG_GETARG_OID(0); float4 relpages = PG_GETARG_FLOAT4(1); ... etc ... * The array manipulations seem to me to be mostly not well chosen. There's no reason to use expanded arrays here, since you won't be modifying the arrays in-place; all that's doing is wasting memory. I'm also noting a lack of defenses against nulls in the arrays. I'd suggest using deconstruct_array to disassemble the arrays, if indeed they need disassembled at all. (Maybe they don't, see next item.) * I'm dubious that we can fully vet the contents of these arrays, and even a little dubious that we need to try. As an example, what's the worst that's going to happen if a histogram array isn't sorted precisely? You might get bogus selectivity estimates from the planner, but that's no worse than you would've got with no stats at all. (It used to be that selfuncs.c would use a histogram even if its contents didn't match the query's collation. The comments justifying that seem to be gone, but I think it's still the case that the code isn't *really* dependent on the sort order being exactly so.) The amount of hastily-written code in the patch for checking this seems a bit scary, and it's well within the realm of possibility that it introduces more bugs than it prevents. We do need to verify data types, lack of nulls, and maybe 1-dimensional-ness, which could break the accessing code at a fairly low level; but I'm not sure that we need more than that. * There's a lot of ERROR cases that maybe we ought to downgrade to WARN-and-press-on, in the service of not breaking the restore completely in case of trouble. * 0002 is confused about whether the tag for these new TOC entries is "STATISTICS" or "STATISTICS DATA". I also think they need to be in SECTION_DATA not SECTION_NONE, and I'd be inclined to make them dependent on the table data objects not the table declarations. We don't really want a parallel restore to load them before the data is loaded: that just increases the risk of bad interactions with concurrent auto-analyze. * It'd definitely not be OK to put BEGIN/COMMIT into the commands in these TOC entries. But I don't think we need to. * dumpRelationStats seems to be dumping the relation-level stats twice. * Why exactly are you suppressing testing of statistics upgrade in 002_pg_upgrade?? regards, tom lane