Hi, On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote: > On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote: > > We recently observed a few cases where Postgres running on Linux > > encountered an issue with WAL segment files. Specifically, two WAL > > segments were linked to the same physical file after Postgres ran out > > of memory and the OOM killer terminated one of its processes. This > > resulted in the WAL segments overwriting each other and Postgres > > failing a later recovery. > > Yikes!
Indeed. As chance would have it, I was asked for input on a corrupted server *today*. Eventually we found that recovery stopped early, after encountering a segment with a *newer* pageaddr than we expected. Which made me think of this issue, and indeed, the file recovery stopped at had two links. Before that the server had been crashing on a regular basis for unrelated reasons, which presumably increased the chances sufficiently to eventually hit this problem. It's a normal thing to discover the end of the WAL by finding a segment that has an older pageaddr than its name suggests. But in this case we saw a newer page address. I wonder if we should treat that differently... > > We found this fix [1] that has been applied to Postgres 16, but the > > cases we observed were running Postgres 15. Given that older major > > versions will be supported for a good number of years, and the > > potential for irrecoverability exists (even if rare), we would like to > > discuss the possibility of back-patching this fix. > > IMHO this is a good time to reevaluate. It looks like we originally didn't > back-patch out of an abundance of caution, but now that this one has had > time to bake, I think it's worth seriously considering, especially now that > we have a report from the field. Strongly agreed. I don't think the issue is actually quite as unlikely to be hit as reasoned in the commit message. The crash has indeed to happen between the link() and unlink() - but at the end of a checkpoint we do that operations hundreds of times in a row on a busy server. And that's just after potentially doing lots of write IO during a checkpoint, filling up drive write caches / eating up IOPS/bandwidth disk quots. Greetings, Andres Freund