On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier <mich...@paquier.xyz> wrote: > Thoughts and/or comments are welcome.
I don't have an opinion yet on your other thread about making this stuff configurable for replicas, but for the simple crash recovery case shown here, hard failure makes sense to me. Here are some interesting points in the history of this topic: 1999 30659d43: xl_len is 16 bit, fixed size buffer in later commits 2001 7d4d5c00: WAL files recycled, xlp_pageaddr added 2004 0ffe11ab: xl_len is 32 bit, dynamic buffer, malloc() failure ends redo 2005 21fda22e: xl_tot_len and xl_len co-exist 2014 2c03216d: xl_tot_len fully replaces xl_len 2018 70b4f82a: xl_tot_len > 1GB ends redo 2023 8fcb32db: don't let xl_tot_len > 1GB be logged! 2023 bae868ca: check next xlp_pageaddr, xlp_rem_len before allocating Recycled pages can't fool us into making a huge allocation any more. If xl_tot_len implies more than one page but the next page's xlp_pageaddr is too low, then either the xl_tot_len you read was recycled garbage bits, or it was legitimate but the overwrite of the following page didn't make it to disk; either way, we don't have a record, so we have an end-of-wal condition. The xlp_rem_len check defends against the second page making it to disk while the first one still contains recycled garbage where the xl_tot_len should be*. What Michael wants to do now is remove the 2004-era assumption that malloc failure implies bogus data. It must be pretty unlikely in a 64 bit world with overcommitted virtual memory, but a legitimate xl_tot_len can falsely end recovery and lose data, as reported from a production case analysed by his colleagues. In other words, we can actually distinguish between lack of resources and recycled bogus data, so why treat them the same? For comparison, if you run out of disk space during recovery we don't say "oh well, that's enough redoing for today, the computer is full, let's forget about the rest of the WAL and start accepting new transactions!". The machine running recovery has certain resource requirements relative to the machine that generated the WAL, and if they're not met it just can't do it. It's the same if it various other allocations fail. The new situation is that we are now verifying that xl_tot_len was actually logged by PostgreSQL, so if we can't allocate space for it, we can't replay the WAL. *A more detailed analysis would talk about sectors (page header is atomic), and consider whether we're only trying to defend ourselves against recycled pages written by PostgreSQL (yes), arbitrary random data (no, but it's probably still pretty good) or someone trying to trick us (no, and we don't stand a chance).