Hi, walsenders currently read WAL data from disk to send it to all replicas (standbys or subscribers connected via streaming or logical replication respectively). This means that walsenders have to wait until the WAL data is flushed to the disk. There are a few issues with this approach:
1. IO saturation on the primary. The amount of read IO required for all walsenders combined can be huge given the sheer number of walsenders typically present at any given point of time in production environments (e.g. for high availability, disaster recovery, read-replicas or subscribers) and life cycle of WAL senders is usually longer (one maintains replicas for a long period of time in production). For example, a quick 30 minute pgbench run with 1 primary, 1 async standby, and 1 sync standby shows that 35 GB of WAL is read from disk on the primary with 3.3 million times for 2 walsenders [3]. 2. Increased query response times, particularly for synchronous standbys, because of WAL flush at primary and standbys usually happen at different times. 3. Increased replication lag, especially if the WAL data is read from disk despite it being present in wal_buffers at times. To improve these issues, I’m proposing that, whenever possible, to let walsenders send WAL directly from wal_buffers to replicas before it is flushed to disk. This idea is also noted elsewhere [1]. Standbys can choose to store the received WAL in wal_buffers (note that the wal_buffers in standbys are allocated but not used until the promotion) and flush if they are full OR store WAL directly to disk, bypassing wal_buffers, but replay only up the flush LSN sent by primary. Logical subscribers can choose to not apply the WAL beyond the flush LSN sent by the primary. This approach has the following advantages: 1. Reduces disk IO or read system calls on the primary. 2. Reduces replication lag. 3. Enables better use of allocated wal_buffers on the standbys. 4. Enables parallel flushing of WAL to disks on both primary and standbys. 5. Disallows async standbys or subscribers getting ahead of the sync standbys, discussed in the thread at [1], reducing efforts required during failovers. This approach has a couple of challenges: 1. Increases stress on wal_buffers - right now there are no readers for wal_buffers on the primary. This could be problematic if there are both many concurrent readers and concurrent writers. 2. wal_buffers hit ratio can be low for write-heavy workloads. In this case disk reads are inevitable. 3. Requires a change to replication protocol. We might have to send flush LSN to replicas and receive their flush LSN as an acknowledgement. 4. Requires careful design for replicas not to replay beyond the received flush LSN. For example, what happens if the wal_buffers get full, should we write the WAL to disk? What happens if the primary or replicas crash? Will they have to get the unwritten, lost WAL present in wal_buffers again? I would like to summarize the whole work into the following 3 independent items and focus individually on each of them: 1. Allow walsenders to read WAL directly from wal_buffers when possible - initial patches and results will be posted soon. This has its own advantages, the comment [2] talks about it. 2. Allow WAL writes and flush to disk happen nearly at the same time both at primary and standbys. 3. Disallow async standbys or subscribers getting ahead of the sync standbys. Thoughts? [1] https://www.postgresql.org/message-id/20220309020123.sneaoijlg3rszvst%40alap3.anarazel.de [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlogreader.c;h=f17e80948d17ff0e2e92fd1677d1a0da06778fc7;hb=7fed801135bae14d63b11ee4a10f6083767046d8#l1457 [3] shared_buffers = 8GB max_wal_size = 32GB checkpoint_timeout = 15min track_wal_io_timing = on wal_buffers = 16MB (auto-tuned values, not manually set) Ubuntu VM: c5.4xlarge - AWS EC2 instance RAM: 32GB VCores: 16 SSD: 512GB ./pgbench —initialize —scale=300 postgres ./pgbench —jobs=16 —progress=300 —client=32 —time=1800 —username=ubuntu postgres -[ RECORD 1 ]------------+--------------- application_name | async_standby1 wal_read | 1685714 wal_read_bytes | 17726209880 wal_read_time | 7746.622 -[ RECORD 2 ]------------+--------------- application_name | sync_standby1 wal_read | 1685771 wal_read_bytes | 17726209880 wal_read_time | 6002.679 -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/