Re: Statistics Import and Export

Greg Sabino Mullane Sat, 01 Mar 2025 10:53:09 -0800

> Can you expand on some of those cases?

Certainly. I think one of the problems is that because this patch is
solving a pg_upgrade issue, the focus is on the "dump and restore"
scenarios. But pg_dump is used for much more than that, especially "dump
and examine".


Although pg_dump is meant to be a canonical, logical representation of your
schema and data, the stats add a non-determinant element to that.
Statistical sampling is random, so pg_dump output changes with each run.
(yes, COPY can also change, but much less so, as I argue later).

One use case is a program that is simply using pg_dump to verify that
nothing has modified your table data (I'll use a single table for these
examples, but obviously this applies to a whole database as well). So let's
say we create a table and populate it at time X, then check back at a later
time to verify things are still exactly as we left them.

dropdb gregtest
createdb gregtest
pgbench gregtest -i 2> /dev/null
pg_dump gregtest -t pgbench_accounts > a1
sleep 10
pg_dump gregtest -t pgbench_accounts > a2
diff a1 a2 | cut -c1-50

100078c100078
<       'histogram_bounds', '{2,964,1921,2917,3892,4935
---
>       'histogram_bounds', '{7,989,1990,2969,3973,4977

While COPY is not going to promise a particular output order, the order
should not change except for manual things: insert, update, delete,
truncate, vacuum full, cluster (off the top of my head). What should not
change the output is a background process gathering some metadata. Or
someone running a database-wide ANALYZE.


Another use case is someone rolling out their schema to a QA box. All the
table definitions and data are checked into a git repository, with a
checksum. They want to roll it out, and then verify that everything is
exactly as they expect it to be. Or the program is part of a test suite
that does a sanity check that the database is in an exact known state
before starting.

(Our system catalogs are very difficult when reverse engineering objects.
Thus, many programs rely on pg_dump to do the heavy lifting for them.
Parsing the text file generated by pg_dump is much easier than trying to
manipulate the system catalogs.)

So let's say the process is to create a new database, load things into it,
and then checksum the result. We can simulate that with pg_bench:

dropdb qa1; dropdb qa2
createdb qa1; createdb qa2
pgbench qa1 -i 2>/dev/null
pgbench qa2 -i 2>/dev/null
pg_dump qa1 > dump1; pg_dump qa2 > dump2

$ md5sum dump1
39a2da5e51e8541e9a2c025c918bf463  dump1

This md5sum does not match our repo! It doesn't even match the other one:

$ md5sum dump2
4a977657dfdf910cb66c875d29cfebf2  dump2

It's the stats, or course, which has added a dose of randomness that was
not there before, and makes our checksums useless:

$ diff dump1 dump2 | cut -c1-50
100172c100172
<       'histogram_bounds', '{1,979,1974,2952,3973,4900
---
>       'histogram_bounds', '{8,1017,2054,3034,4045,513

With --no-statistics, the diff shows no difference, and the md5sum is
always the same.

Just to be clear, I love this patch, and I love the fact that one of our
major upgrade warts is finally getting fixed. I've tried fixing it myself a
few times over the last decade or so, but lacked the skills to do so. :) So
I am thrilled to have this finally done. I just don't think it should be
enabled by default for everything using pg_dump. For the record, I would
not strongly object to having stats on by default for binary dumps,
although I would prefer them off.

So why not just expect people to modify their programs to use
--no-statistics for cases like this? That's certainly an option, but it's
going to break a lot of existing things, and create branching code:

old code:
pg_dump mydb -f pg.dump

new code:
if pg_dump.version >= 18
  pg_dump --no-statistics mydb -f pg.dump
else
  pg_dump mydb -f pg.dump

Also, anything trained to parse pg_dump output will have to learn about the
new SELECT pg_restore_ calls with their multi-line formats (not 100% sure
we don't have that anywhere, as things like "SELECT setval" and "SELECT
set_config" are single line, but there may be existing things)


Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Re: Statistics Import and Export

Reply via email to