Hello list, I'm submitting a patch for improving an almost 1h long pause at the start of parallel pg_restore of a big archive. Related discussion has taken place at pgsql-performance mailing list at:
https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net I think I explain it rather well in the commit message, so I paste it inline: Improve the performance of parallel pg_restore (-j) from a custom format pg_dump archive that does not include data offsets - typically happening when pg_dump has generated it by writing to stdout instead of a file. In this case pg_restore workers manifest constant looping of reading small sizes (4KB) and seeking forward small lenths (around 10KB for a compressed archive): read(4, "..."..., 4096) = 4096 lseek(4, 55544369152, SEEK_SET) = 55544369152 read(4, "..."..., 4096) = 4096 lseek(4, 55544381440, SEEK_SET) = 55544381440 read(4, "..."..., 4096) = 4096 lseek(4, 55544397824, SEEK_SET) = 55544397824 read(4, "..."..., 4096) = 4096 lseek(4, 55544414208, SEEK_SET) = 55544414208 read(4, "..."..., 4096) = 4096 lseek(4, 55544426496, SEEK_SET) = 55544426496 This happens as each worker scans the whole file until it finds the entry it wants, skipping forward each block. In combination to the small block size of the custom format dump, this causes many seeks and low performance. Fix by avoiding forward seeks for jumps of less than 1MB forward. Do instead sequential reads. Performance gain can be significant, depending on the size of the dump and the I/O subsystem. On my local NVMe drive, read speeds for that phase of pg_restore increased from 150MB/s to 3GB/s. This is my first patch submission, all help is much appreciated. Regards, Dimitris P.S. What is the recommended way to test a change, besides a generic make check? And how do I run selectively only the pg_dump/restore tests, in order to speed up my development routine?
From 5a43ce169a8b03a866728f2af8117871203f395a Mon Sep 17 00:00:00 2001 From: Dimitrios Apostolou <ji...@qt.io> Date: Sat, 29 Mar 2025 01:16:07 +0100 Subject: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward Improve the performance of parallel pg_restore (-j) from a custom format pg_dump archive that does not include data offsets - typically happening when pg_dump has generated it by writing to stdout instead of a file. In this case pg_restore workers manifest constant looping of reading small sizes (4KB) and seeking forward small lenths (around 10KB for a compressed archive): read(4, "..."..., 4096) = 4096 lseek(4, 55544369152, SEEK_SET) = 55544369152 read(4, "..."..., 4096) = 4096 lseek(4, 55544381440, SEEK_SET) = 55544381440 read(4, "..."..., 4096) = 4096 lseek(4, 55544397824, SEEK_SET) = 55544397824 read(4, "..."..., 4096) = 4096 lseek(4, 55544414208, SEEK_SET) = 55544414208 read(4, "..."..., 4096) = 4096 lseek(4, 55544426496, SEEK_SET) = 55544426496 This happens as each worker scans the whole file until it finds the entry it wants, skipping forward each block. In combination to the small block size of the custom format dump, this causes many seeks and low performance. Fix by avoiding forward seeks for jumps of less than 1MB forward. Do instead sequential reads. Performance gain can be significant, depending on the size of the dump and the I/O subsystem. On my local NVMe drive, read speeds for that phase of pg_restore increased from 150MB/s to 3GB/s. --- src/bin/pg_dump/pg_backup_custom.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index e44b887eb29..d32c9c64d23 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -623,19 +623,21 @@ _skipData(ArchiveHandle *AH) { lclContext *ctx = (lclContext *) AH->formatData; size_t blkLen; char *buf = NULL; int buflen = 0; blkLen = ReadInt(AH); while (blkLen != 0) { - if (ctx->hasSeek) + /* Sequential access is usually faster, so avoid seeking if the + * jump forward is shorter than 1MB. */ + if (ctx->hasSeek && blkLen > 1024 * 1024) { if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0) pg_fatal("error during file seek: %m"); } else { if (blkLen > buflen) { free(buf); -- 2.48.1