On Wed, Mar 29, 2023 at 8:28 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > I took a look at v9 and LGTM.
Pushed, thanks. There is still an outstanding question around the overhead of outputting FPIs and even block data from pg_get_wal_block_info(). At one point Melanie suggested that we'd need to do something about that, and I tend to agree. Attached patch provides an optional parameter that will make pg_get_wal_block_info return NULLs for both block_data and block_fpi_data, no matter whether or not there is something to show. Note that this only affects those two bytea columns; we'll still show everything else, including valid block_data_length and block_fpi_length values (so the metadata describing the on-disk size of block_data and block_fpi_data is unaffected). To test this patch, I ran pgbench for about 5 minutes, using a fairly standard configuration with added indexes and with wal_log_hints enabled. I ended up with the following WAL records afterwards: pg@regression:5432 [1402115]=# SELECT "resource_manager/record_type" t, pg_size_pretty(combined_size) s, fpi_size_percentage perc_fpi FROM pg_get_wal_Stats ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', FALSE) where combined_size > 0; ┌─[ RECORD 1 ]──────────────────┐ │ t │ XLOG │ │ s │ 1557 MB │ │ perc_fpi │ 22.029466865781302 │ ├─[ RECORD 2 ]──────────────────┤ │ t │ Transaction │ │ s │ 49 MB │ │ perc_fpi │ 0 │ ├─[ RECORD 3 ]──────────────────┤ │ t │ Storage │ │ s │ 13 kB │ │ perc_fpi │ 0 │ ├─[ RECORD 4 ]──────────────────┤ │ t │ CLOG │ │ s │ 1380 bytes │ │ perc_fpi │ 0 │ ├─[ RECORD 5 ]──────────────────┤ │ t │ Database │ │ s │ 118 bytes │ │ perc_fpi │ 0 │ ├─[ RECORD 6 ]──────────────────┤ │ t │ RelMap │ │ s │ 565 bytes │ │ perc_fpi │ 0 │ ├─[ RECORD 7 ]──────────────────┤ │ t │ Standby │ │ s │ 30 kB │ │ perc_fpi │ 0 │ ├─[ RECORD 8 ]──────────────────┤ │ t │ Heap2 │ │ s │ 4235 MB │ │ perc_fpi │ 0.6731388657682449 │ ├─[ RECORD 9 ]──────────────────┤ │ t │ Heap │ │ s │ 4482 MB │ │ perc_fpi │ 54.46811493602934 │ ├─[ RECORD 10 ]─────────────────┤ │ t │ Btree │ │ s │ 1786 MB │ │ perc_fpi │ 22.829279332421116 │ └──────────┴────────────────────┘ Time: 3618.693 ms (00:03.619) So about 12GB of WAL -- certainly enough to be a challenge for pg_walinspect. I then ran the following query several times over the same LSN range as before, with my patch applied, but with behavior equivalent to current git HEAD (this is with outputting block_data and block_fpi_data values still turned on): pg@regression:5432 [1402115]=# SELECT count(*) FROM pg_get_wal_block_info ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', false); ┌─[ RECORD 1 ]───────┐ │ count │ 17,031,979 │ └───────┴────────────┘ Time: 35171.463 ms (00:35.171) The time shown here is typical of what I saw. And now the same query, but without any overhead for outputting block_data and block_fpi_data values: pg@regression:5432 [1402115]=# SELECT count(*) FROM pg_get_wal_block_info ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', true); ┌─[ RECORD 1 ]───────┐ │ count │ 17,031,979 │ └───────┴────────────┘ Time: 15235.499 ms (00:15.235) This time is also typical of what I saw. The variance was fairly low, so I won't bother describing it. I think that this is a compelling reason to apply the patch. It would be possible to get about 75% of the benefit shown here by just suppressing block_fpi_data output, without suppressing block_data, but I think that it makes sense to either suppress both or neither. Things like page split records can write a fairly large amount of WAL in a way that resembles an FPI, even though technically no FPI is involved. If there are no objections, I'll move ahead with committing something along the lines of this patch in the next couple of days. -- Peter Geoghegan
v1-0001-pg_get_wal_block_info-suppress-block-data-outputs.patch
Description: Binary data