Hello, I would greatly appreciate your insight into an issue where pg_restore runs significantly slower than expected, even though the underlying storage shows very high random write throughput. I am trying to understand which PostgreSQL mechanisms or system layers I should investigate next in order to pinpoint the bottleneck and improve restore performance.
The central question is: What should I examine further to understand why checkpoint processing becomes the dominant bottleneck during restore, despite fsync=off, synchronous_commit=off, and excellent random write latency? Below is a detailed description of the environment, the behavior observed, the steps I have already taken, and the research performed so far. During pg_restore, execution time remains extremely long: around 2+ hours using a custom-format dump and over 4 hours using directory format. The machine consistently demonstrates high random write performance (median latency ~5 ms, ~45k random write IOPS), yet PostgreSQL logs show very long checkpoints where the write phase dominates (hundreds to thousands of seconds). Checkpoints appear to stall the entire restore process. I have tested multiple combinations of dump formats (custom and directory) and parallel jobs (j = 1, 12, 18). The restore duration barely changes. This strongly suggests that the bottleneck is not client-side parallelism but internal server behavior—specifically the checkpoint write phase. Example log excerpts show checkpoint write times consistently in the range of 600–1100 seconds, with large numbers of buffers written (from hundreds of thousands to over 1.6 million). Sync times remain negligible because fsync is disabled, reinforcing the suspicion that PostgreSQL's internal buffer flushing and write throttling mechanisms are the source of slowdown, not WAL or filesystem sync. Given that: * Storage is fast, * fsync and synchronous commits are disabled, * full_page_writes is off, * wal_level is minimal, * autovacuum is off, * the restore is the only workload, I am trying to determine what further PostgreSQL internals or Linux I/O mechanisms may explain why these checkpoints are taking orders of magnitude longer than the device’s raw write characteristics would suggest. I am particularly looking for guidance on: * Whether backend or checkpointer write throttling may still be limiting write concurrency even during bulk restore, * Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could introduce any serialization invisible to raw I/O tests, * Whether certain parameters (e.g., effective_io_concurrency, maintenance_io_concurrency, wal_writer settings, combine limits, io_uring behavior) could unintentionally reduce write throughput, * Whether parallel pg_restore is inherently constrained by global buffer flushing behavior, * Any other PostgreSQL mechanisms that could cause prolonged checkpoint write durations even with crash-safety disabled. Below are the configuration values and environment details referenced above. Machine: Hyper-V VM 24 vCPU 80 GB RAM Ubuntu 24.04.3 (kernel 6.8.0-88) PostgreSQL 18.1 Database size: ~700 GB across two tablespaces on separate disks (freshly restored) Storage layout: Each disk is its own VHDX LVM on battery-backed SSD array XFS for PGDATA Barriers disabled Random write performance (steady state): Median latency: 5.1 ms IOPS: ~45.6k Restore tests: pg_restore custom format: ~2h+ pg_restore directory format: ~4h+ Parallelism tested with j = 1, 12, 18, 24 Representative checkpoint log entries: (write phases ranging 76–1079 seconds, buffer writes up to 1.6M) postgresql.conf (relevant parts): shared_buffers = 20GB work_mem = 150MB maintenance_work_mem = 8GB effective_io_concurrency = 1 maintenance_io_concurrency = 1 io_max_combine_limit = 512kB io_combine_limit = 1024kB io_method = io_uring fsync = off synchronous_commit = off wal_sync_method = fdatasync full_page_writes = off wal_compression = lz4 checkpoint_timeout = 60min checkpoint_completion_target = 0.9 max_wal_size = 80GB min_wal_size = 10GB effective_cache_size = 65GB autovacuum = off max_locks_per_transaction = 256 If anyone has encountered similar behavior or can recommend specific PostgreSQL subsystems, kernel settings, or I/O patterns worth investigating, I would be very grateful for advice. My main goal is to understand why checkpoint writes are so slow relative to the hardware’s demonstrated capabilities, and how to safely accelerate the restore workflow. Thank you in advance for any guidance.
