One of our customers ran into a very odd case, where hot standby feedback backend_xmin propagation stopped working due to major (hours/days) clock time shifts on hypervisor-managed VMs. This happens (and is fully reproducible) e.g. in scenarios where standby connects and its own VM is having time from the future (relative to primary) and then that time goes back to "normal". In such situation "sends hot_standby_feedback xmin" timestamp messages are stopped being transferred, e.g.:
2024-12-05 02:02:35 UTC [6002]: db=,user=,app=,client= DEBUG: sending hot standby feedback xmin 1614031 epoch 0 catalog_xmin 0 catalog_xmin_epoch 0 2024-12-05 02:02:45 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015230 flush 6/E9015230 apply 6/E9015230 2024-12-05 02:02:45 UTC [6002]: db=,user=,app=,client= DEBUG: sending hot standby feedback xmin 1614031 epoch 0 catalog_xmin 0 catalog_xmin_epoch 0 <-- clock readjustment and no further "sending hot standby feedback" 2024-12-04 14:18:54 UTC [6002]: db=,user=,app=,client= DEBUG: sendtime 2024-12-04 14:18:51.836936+00 receipttime 2024-12-04 14:18:54.199223+00 replication apply delay 0 ms transfer latency 2363 ms 2024-12-04 14:18:54 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015258 flush 6/E9015230 apply 6/E9015230 2024-12-04 14:18:54 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015258 flush 6/E9015258 apply 6/E9015258 2024-12-04 14:18:54 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015258 flush 6/E9015258 apply 6/E9015258 2024-12-04 14:18:55 UTC [6002]: db=,user=,app=,client= DEBUG: sendtime 2024-12-04 14:18:53.136738+00 receipttime 2024-12-04 14:18:55.498946+00 replication apply delay 0 ms transfer latency 2363 ms 2024-12-04 14:18:55 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015280 flush 6/E9015258 apply 6/E9015258 2024-12-04 14:18:55 UTC [6002]: db=,user=,app=,client= DEBUG: sending write 6/E9015280 flush 6/E9015280 apply 6/E9015280 I can share reproduction steps if anyone is interested. This basically happens due to usage of TimestampDifferenceExceeds() in XLogWalRcvSendHSFeedback(), but I bet there are other similiar scenarios. What I was kind of surprised about was the lack of recommendation for having primary/standby to have clocks synced when using hot_standby_feedback, but such a thing is mentioned for recovery_min_apply_delay. So I would like to add at least one sentence to hot_standby_feedback to warn about this too, patch attached. -J.
v1-0001-doc-Mention-clock-synchronization-recommendation-.patch
Description: Binary data