On Tue, Apr 01, 2025 at 03:05:59PM -0700, Jeff Davis wrote: > To restate the problem: one of the problems being solved here is that > the existing code for custom-format dumps calls WriteToc twice. That > was not a big problem before this patch, when the contents of the > entries was easily accessible in memory. But the point of 0002 is to > avoid keeping all of the stats in memory at once, because that causes > bloat; and instead to query it on demand. > > In theory, we could fix the pre-existing code by making the second pass > able to jump over the other contents of the entry and just update the > data offsets. But that seems invasive, at least to do it properly. > > 0001 sidesteps the problem by skipping the second pass if data's not > being dumped (because there are no offsets that need updating). The > worst case is when there are a lot of objects with a small amount of > data. But that's a worst case for stats in general, so I don't think > that needs to be solved here. > > Issuing the stats queries twice is not great, though. If there's any > non-deterministic output in the query, that could lead to strangeness. > How bad can that be? If the results change in some way that looks > benign, but changes the length of the definition string, can it lead to > corruption of a ToC entry? I'm not saying there's a problem, but trying > to understand the risk of future problems.
It certainly feels risky. I was able to avoid executing the queries twice in all cases by saving the definition length in the TOC entry and skipping that many bytes the second time round. That's simple enough, but it relies on various assumptions such as fseeko() being available (IIUC the file will only be open for writing so we cannot fall back on fread()) and WriteStr() returning an accurate value (which I'm skeptical of because some formats compress this data). But AFAICT custom format is the only format that does a second WriteToc() pass at the moment, and it only does so when fseeko() is usable. Plus, custom format doesn't appear to compress anything written via WriteStr(). We might be able to improve this by inventing a new callback that fails for all formats except for custom with feesko() available. That would at least ensure hard failures if these assumptions change. That problably wouldn't be terribly invasive. I'm curious what you think. > For 0003, it makes an assumption about the way the scan happens in > WriteToc(). Can you add some additional sanity checks to verify that > something doesn't happen in a different order than we expect? Hm. One thing we could do is to send the TocEntry to the callback and verify that matches the one we were expecting to see next (as set by a previous call). Does that sound like a strong enough check? FWIW the pg_dump tests failed miserably until Corey and I got this part right, so our usual tests should also offer some assurance. > Also, why do we need the clause "WHERE s.tablename = ANY($2)"? Isn't > that already implied by "JOIN unnest($1, $2) ... s.tablename = > u.tablename"? Good question. Corey, do you recall why this was needed? -- nathan
>From 0b20e8a4c8f153e9f292da82a9dbc82ed3adbb3e Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nat...@postgresql.org> Date: Mon, 31 Mar 2025 10:44:26 -0500 Subject: [PATCH v12n4 1/3] Skip second WriteToc() for custom-format dumps without data. Presently, "pg_dump --format=custom" calls WriteToc() twice. The second call is intended to update the data offset information, which allegedly makes parallel pg_restore significantly faster. However, if we aren't dumping any data, this step accomplishes nothing and can be skipped. This is a preparatory optimization for follow-up commits that will move the queries for attribute statistics to WriteToc() to save memory. Reviewed-by: Jeff Davis <pg...@j-davis.com> Discussion: https://postgr.es/m/Z9c1rbzZegYQTOQE%40nathan --- src/bin/pg_dump/pg_backup_custom.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index e44b887eb29..b971e3bc16e 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -755,9 +755,11 @@ _CloseArchive(ArchiveHandle *AH) * If possible, re-write the TOC in order to update the data offset * information. This is not essential, as pg_restore can cope in most * cases without it; but it can make pg_restore significantly faster - * in some situations (especially parallel restore). + * in some situations (especially parallel restore). We can skip this + * step if we're not dumping any data. */ - if (ctx->hasSeek && + if (AH->public.dopt->dumpData && + ctx->hasSeek && fseeko(AH->FH, tpos, SEEK_SET) == 0) WriteToc(AH); } -- 2.39.5 (Apple Git-154)
>From 6e6d5345d45092bc0acc5ca31c7d7d663fe2ccad Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nat...@postgresql.org> Date: Tue, 1 Apr 2025 20:46:24 -0500 Subject: [PATCH v12n4 2/3] pg_dump: Reduce memory usage of dumps with statistics. Right now, pg_dump stores all generated commands for statistics in memory. These commands can be quite large and therefore can significantly increase pg_dump's memory footprint. To fix, wait until we are about to write out the commands before generating them, and be sure to free the commands after writing. This is implemented via a new defnDumper callback that works much like the dataDumper one but is specially designed for TOC entries. Custom dumps that include data might write the TOC twice (to update data offset information), which would ordinarily cause pg_dump to run the attribute statistics queries twice. However, as a hack, we save the length of the written-out entry in the first pass, and we skip over it in the second. While there is no known technical problem with executing the queries multiple times and rewriting the results, it's expensive and feels risky, so it seems prudent to avoid it. Author: Corey Huinker <corey.huin...@gmail.com> Reviewed-by: Jeff Davis <pg...@j-davis.com> Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com --- src/bin/pg_dump/pg_backup.h | 1 + src/bin/pg_dump/pg_backup_archiver.c | 44 ++++++++++++++++++++++++++-- src/bin/pg_dump/pg_backup_archiver.h | 6 ++++ src/bin/pg_dump/pg_dump.c | 44 ++++++++++++++++++++-------- 4 files changed, 81 insertions(+), 14 deletions(-) diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 658986de6f8..781f8fa1cc9 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -285,6 +285,7 @@ typedef int DumpId; * Function pointer prototypes for assorted callback methods. */ +typedef char *(*DefnDumperPtr) (Archive *AH, const void *userArg); typedef int (*DataDumperPtr) (Archive *AH, const void *userArg); typedef void (*SetupWorkerPtrType) (Archive *AH); diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 1d131e5a57d..334b5dedfd7 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -1266,6 +1266,9 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId, newToc->dataDumperArg = opts->dumpArg; newToc->hadDumper = opts->dumpFn ? true : false; + newToc->defnDumper = opts->defnFn; + newToc->defnDumperArg = opts->defnArg; + newToc->formatData = NULL; newToc->dataLength = 0; @@ -2621,7 +2624,33 @@ WriteToc(ArchiveHandle *AH) WriteStr(AH, te->tag); WriteStr(AH, te->desc); WriteInt(AH, te->section); - WriteStr(AH, te->defn); + + if (te->defnLen) + { + /* + * We only set defnLen when a definition is generated by the + * defnDumper during WriteToc(), so this must be a second + * WriteToc() pass. The defnDumper might execute queries, and + * while running the same queries twice should in theory work + * fine, it's expensive and feels risky. So, we just seek through + * those entries. Presently, the only time we do a second + * WriteToc() pass is for custom-format dumps when we've already + * verified fseeko() works, so we can use it without checking. + * We'll need to figure out something else if this changes. + */ + if (fseeko(AH->FH, te->defnLen, SEEK_CUR) != 0) + pg_fatal("error during file seek: %m"); + } + else if (te->defnDumper) + { + char *defn = te->defnDumper((Archive *) AH, te->defnDumperArg); + + te->defnLen = WriteStr(AH, defn); + pg_free(defn); + } + else + WriteStr(AH, te->defn); + WriteStr(AH, te->dropStmt); WriteStr(AH, te->copyStmt); WriteStr(AH, te->namespace); @@ -3849,7 +3878,7 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, const char *pfx) /* * Actually print the definition. Normally we can just print the defn - * string if any, but we have three special cases: + * string if any, but we have four special cases: * * 1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump * versions put into CREATE SCHEMA. Don't mutate the variant for schema @@ -3862,6 +3891,10 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, const char *pfx) * 3. ACL LARGE OBJECTS entries need special processing because they * contain only one copy of the ACL GRANT/REVOKE commands, which we must * apply to each large object listed in the associated BLOB METADATA. + * + * 4. Entries with a defnDumper need to call it to generate the + * definition. This is primarily intended to provide a way to save memory + * for objects that need a lot of it (e.g., statistics data). */ if (ropt->noOwner && strcmp(te->desc, "SCHEMA") == 0 && strncmp(te->defn, "--", 2) != 0) @@ -3877,6 +3910,13 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, const char *pfx) { IssueACLPerBlob(AH, te); } + else if (te->defnDumper) + { + char *defn = te->defnDumper((Archive *) AH, te->defnDumperArg); + + ahprintf(AH, "%s\n\n", defn); + pg_free(defn); + } else if (te->defn && strlen(te->defn) > 0) { ahprintf(AH, "%s\n\n", te->defn); diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index a2064f471ed..b7ebc2b39cd 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -368,6 +368,10 @@ struct _tocEntry const void *dataDumperArg; /* Arg for above routine */ void *formatData; /* TOC Entry data specific to file format */ + DefnDumperPtr defnDumper; /* routine to dump definition statement */ + const void *defnDumperArg; /* arg for above routine */ + size_t defnLen; /* length of dumped definition */ + /* working state while dumping/restoring */ pgoff_t dataLength; /* item's data size; 0 if none or unknown */ int reqs; /* do we need schema and/or data of object @@ -407,6 +411,8 @@ typedef struct _archiveOpts int nDeps; DataDumperPtr dumpFn; const void *dumpArg; + DefnDumperPtr defnFn; + const void *defnArg; } ArchiveOpts; #define ARCHIVE_OPTS(...) &(ArchiveOpts){__VA_ARGS__} /* Called to add a TOC entry */ diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 4ca34be230c..46fb70e0a8b 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -10560,13 +10560,16 @@ appendNamedArgument(PQExpBuffer out, Archive *fout, const char *argname, } /* - * dumpRelationStats -- + * dumpRelationStats_dumper -- * - * Dump command to import stats into the relation on the new database. + * Generate command to import stats into the relation on the new database. + * This routine is called by the Archiver when it wants the statistics to be + * dumped. */ -static void -dumpRelationStats(Archive *fout, const RelStatsInfo *rsinfo) +static char * +dumpRelationStats_dumper(Archive *fout, const void *userArg) { + const RelStatsInfo *rsinfo = (RelStatsInfo *) userArg; const DumpableObject *dobj = &rsinfo->dobj; PGresult *res; PQExpBuffer query; @@ -10586,10 +10589,7 @@ dumpRelationStats(Archive *fout, const RelStatsInfo *rsinfo) int i_range_length_histogram; int i_range_empty_frac; int i_range_bounds_histogram; - - /* nothing to do if we are not dumping statistics */ - if (!fout->dopt->dumpStatistics) - return; + char *ret; query = createPQExpBuffer(); if (!fout->is_prepared[PREPQUERY_GETATTRIBUTESTATS]) @@ -10770,17 +10770,37 @@ dumpRelationStats(Archive *fout, const RelStatsInfo *rsinfo) PQclear(res); + destroyPQExpBuffer(query); + ret = out->data; + pg_free(out); + return ret; +} + +/* + * dumpRelationStats -- + * + * Make an ArchiveEntry for the relation statistics. The Archiver will take + * care of gathering the statistics and generating the restore commands when + * they are needed. + */ +static void +dumpRelationStats(Archive *fout, const RelStatsInfo *rsinfo) +{ + const DumpableObject *dobj = &rsinfo->dobj; + + /* nothing to do if we are not dumping statistics */ + if (!fout->dopt->dumpStatistics) + return; + ArchiveEntry(fout, nilCatalogId, createDumpId(), ARCHIVE_OPTS(.tag = dobj->name, .namespace = dobj->namespace->dobj.name, .description = "STATISTICS DATA", .section = rsinfo->section, - .createStmt = out->data, + .defnFn = dumpRelationStats_dumper, + .defnArg = rsinfo, .deps = dobj->dependencies, .nDeps = dobj->nDeps)); - - destroyPQExpBuffer(out); - destroyPQExpBuffer(query); } /* -- 2.39.5 (Apple Git-154)
>From 299c7b6d0f5b009d42ee5a4a28278ab3c1bc725c Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nat...@postgresql.org> Date: Tue, 1 Apr 2025 21:00:24 -0500 Subject: [PATCH v12n4 3/3] pg_dump: Retrieve attribute statistics in batches. Currently, pg_dump gathers attribute statistics with a query per relation, which can cause pg_dump to take significantly longer, especially when there are many tables. This commit improves matters by gathering attribute statistics for 64 relations at a time. Some simple testing showed this was the ideal batch size, but performance may vary depending on workload. This change increases the memory usage of pg_dump a bit, but that isn't expected to be too egregious and is arguably well worth the trade-off. Author: Corey Huinker <corey.huin...@gmail.com> Reviewed-by: Jeff Davis <pg...@j-davis.com> Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com --- src/bin/pg_dump/pg_dump.c | 104 +++++++++++++++++++++++++++++++------- 1 file changed, 86 insertions(+), 18 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 46fb70e0a8b..9d60c77865f 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -209,6 +209,9 @@ static int nbinaryUpgradeClassOids = 0; static SequenceItem *sequences = NULL; static int nsequences = 0; +/* Maximum number of relations to fetch in a fetchAttributeStats() call. */ +#define MAX_ATTR_STATS_RELS 64 + /* * The default number of rows per INSERT when * --inserts is specified without --rows-per-insert @@ -10559,6 +10562,63 @@ appendNamedArgument(PQExpBuffer out, Archive *fout, const char *argname, appendPQExpBuffer(out, "::%s", argtype); } +/* + * fetchAttributeStats -- + * + * Fetch next batch of rows for getAttributeStats(). + */ +static PGresult * +fetchAttributeStats(Archive *fout) +{ + ArchiveHandle *AH = (ArchiveHandle *) fout; + PQExpBuffer nspnames = createPQExpBuffer(); + PQExpBuffer relnames = createPQExpBuffer(); + int count = 0; + PGresult *res = NULL; + static TocEntry *te; + + /* If we're just starting, set our TOC pointer. */ + if (!te) + te = AH->toc->next; + + /* + * Scan the TOC for the next set of relevant stats entries. We assume + * that statistics are dumped in the order they are listed in the TOC. + * This is perhaps not the sturdiest assumption, but it appears to + * presently be true in all cases. + */ + for (; te != AH->toc && count < MAX_ATTR_STATS_RELS; te = te->next) + { + if (te->reqs && strcmp(te->desc, "STATISTICS DATA") == 0) + { + RelStatsInfo *rsinfo = (RelStatsInfo *) te->defnDumperArg; + + appendPQExpBuffer(nspnames, "%s%s", count ? "," : "", + fmtId(rsinfo->dobj.namespace->dobj.name)); + appendPQExpBuffer(relnames, "%s%s", count ? "," : "", + fmtId(rsinfo->dobj.name)); + count++; + } + } + + /* Execute the query for the next batch of relations. */ + if (count > 0) + { + PQExpBuffer query = createPQExpBuffer(); + + appendPQExpBuffer(query, "EXECUTE getAttributeStats(" + "'{%s}'::pg_catalog.name[]," + "'{%s}'::pg_catalog.name[])", + nspnames->data, relnames->data); + res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK); + destroyPQExpBuffer(query); + } + + destroyPQExpBuffer(nspnames); + destroyPQExpBuffer(relnames); + return res; +} + /* * dumpRelationStats_dumper -- * @@ -10571,9 +10631,12 @@ dumpRelationStats_dumper(Archive *fout, const void *userArg) { const RelStatsInfo *rsinfo = (RelStatsInfo *) userArg; const DumpableObject *dobj = &rsinfo->dobj; - PGresult *res; + static PGresult *res; + static int rownum; PQExpBuffer query; PQExpBuffer out; + int i_schemaname; + int i_tablename; int i_attname; int i_inherited; int i_null_frac; @@ -10595,8 +10658,8 @@ dumpRelationStats_dumper(Archive *fout, const void *userArg) if (!fout->is_prepared[PREPQUERY_GETATTRIBUTESTATS]) { appendPQExpBufferStr(query, - "PREPARE getAttributeStats(pg_catalog.name, pg_catalog.name) AS\n" - "SELECT s.attname, s.inherited, " + "PREPARE getAttributeStats(pg_catalog.name[], pg_catalog.name[]) AS\n" + "SELECT s.schemaname, s.tablename, s.attname, s.inherited, " "s.null_frac, s.avg_width, s.n_distinct, " "s.most_common_vals, s.most_common_freqs, " "s.histogram_bounds, s.correlation, " @@ -10616,9 +10679,11 @@ dumpRelationStats_dumper(Archive *fout, const void *userArg) appendPQExpBufferStr(query, "FROM pg_catalog.pg_stats s " - "WHERE s.schemaname = $1 " - "AND s.tablename = $2 " - "ORDER BY s.attname, s.inherited"); + "JOIN unnest($1, $2) WITH ORDINALITY AS u (schemaname, tablename, ord) " + "ON s.schemaname = u.schemaname " + "AND s.tablename = u.tablename " + "WHERE s.tablename = ANY($2) " + "ORDER BY u.ord, s.attname, s.inherited"); ExecuteSqlStatement(fout, query->data); @@ -10648,16 +10713,16 @@ dumpRelationStats_dumper(Archive *fout, const void *userArg) appendPQExpBufferStr(out, "\n);\n"); + /* Fetch the next batch of attribute statistics if needed. */ + if (rownum >= PQntuples(res)) + { + PQclear(res); + res = fetchAttributeStats(fout); + rownum = 0; + } - /* fetch attribute stats */ - appendPQExpBufferStr(query, "EXECUTE getAttributeStats("); - appendStringLiteralAH(query, dobj->namespace->dobj.name, fout); - appendPQExpBufferStr(query, ", "); - appendStringLiteralAH(query, dobj->name, fout); - appendPQExpBufferStr(query, ");"); - - res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK); - + i_schemaname = PQfnumber(res, "schemaname"); + i_tablename = PQfnumber(res, "tablename"); i_attname = PQfnumber(res, "attname"); i_inherited = PQfnumber(res, "inherited"); i_null_frac = PQfnumber(res, "null_frac"); @@ -10675,10 +10740,15 @@ dumpRelationStats_dumper(Archive *fout, const void *userArg) i_range_bounds_histogram = PQfnumber(res, "range_bounds_histogram"); /* restore attribute stats */ - for (int rownum = 0; rownum < PQntuples(res); rownum++) + for (; rownum < PQntuples(res); rownum++) { const char *attname; + /* Stop if the next stat row in our cache isn't for this relation. */ + if (strcmp(dobj->name, PQgetvalue(res, rownum, i_tablename)) != 0 || + strcmp(dobj->namespace->dobj.name, PQgetvalue(res, rownum, i_schemaname)) != 0) + break; + appendPQExpBufferStr(out, "SELECT * FROM pg_catalog.pg_restore_attribute_stats(\n"); appendPQExpBuffer(out, "\t'version', '%u'::integer,\n", fout->remoteVersion); @@ -10768,8 +10838,6 @@ dumpRelationStats_dumper(Archive *fout, const void *userArg) appendPQExpBufferStr(out, "\n);\n"); } - PQclear(res); - destroyPQExpBuffer(query); ret = out->data; pg_free(out); -- 2.39.5 (Apple Git-154)