Hi, On 2021-05-04 15:47:41 -0400, Tom Lane wrote: > BTW, that conclusion shouldn't distract us from the very real bug > that Andres identified. I was just scraping the buildfarm logs > concerning recent failures, and I found several recent cases > that match the symptom he reported: > [...] > They all show the standby in recovery/019_replslot_limit.pl failing > with symptoms like > > 2021-05-04 07:42:00.968 UTC [24707406:1] LOG: database system was shut down > in recovery at 2021-05-04 07:41:39 UTC > 2021-05-04 07:42:00.968 UTC [24707406:2] LOG: entering standby mode > 2021-05-04 07:42:01.050 UTC [24707406:3] LOG: redo starts at 0/1C000D8 > 2021-05-04 07:42:01.079 UTC [24707406:4] LOG: consistent recovery state > reached at 0/1D00000 > 2021-05-04 07:42:01.079 UTC [24707406:5] FATAL: invalid memory alloc request > size 1476397045 > 2021-05-04 07:42:01.080 UTC [13238274:3] LOG: database system is ready to > accept read only connections > 2021-05-04 07:42:01.082 UTC [13238274:4] LOG: startup process (PID 24707406) > exited with exit code 1
Yea, that's the pre-existing end-of-log-issue that got more likely as well as more consequential (by accident) in Thomas' patch. It's easy to reach parity with the state in 13, it's just changing the order in one place. But I think we need to do something for all branches here. The bandaid that was added to allocate_recordbuf() does doesn't really seems sufficient to me. This is commit 70b4f82a4b5cab5fc12ff876235835053e407155 Author: Michael Paquier <mich...@paquier.xyz> Date: 2018-06-18 10:43:27 +0900 Prevent hard failures of standbys caused by recycled WAL segments In <= 13 the current state is that we'll allocate effectively random bytes as long as the random number is below 1GB whenever we reach the end of the WAL with the record on a page boundary (because there we don't. That allocation is then not freed for the lifetime of the xlogreader. And for FRONTEND uses of xlogreader we'll just happily allocate 4GB. The specific problem here is that we don't validate the record header before allocating when the record header is split across a page boundary - without much need as far as I can tell? Until we've read the entire header, we actually don't need to allocate the record buffer? This seems like an issue that needs to be fixed to be more robust in crash recovery scenarios where obviously we could just have failed with half written records. But the issue that 70b4f82a4b is trying to address seems bigger to me. The reason it's so easy to hit the issue is that walreceiver does < 8KB writes into recycled WAL segments *without* zero-filling the tail end of the page - which will commonly be filled with random older contents, because we'll use a recycled segments. I think that *drastically* increases the likelihood of finding something that looks like a valid record header compared to the situation on a primary where the zeroing pages before use makes that pretty unlikely. > (BTW, the behavior seen here where the failure occurs *immediately* > after reporting "consistent recovery state reached" is seen in the > other reports as well, including Andres' version. I wonder if that > means anything.) That's to be expected, I think. There's not a lot of data that needs to be replayed, and we'll always reach consistency before the end of the WAL unless you're dealing with starting from an in-progress base-backup that hasn't yet finished or such. The test causes replication to fail shortly after that, so we'll always switch to doing recovery from pg_wal, which then will hit the end of the WAL, hitting this issue with, I think, ~25% likelihood (data from recycled WAL data is probably *roughly* evenly distributed, and any 4byte value above 1GB will hit this error in 14). Greetings, Andres Freund