At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in > At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <mich...@paquier.xyz> > wrote in > > > While it's a kind of bug in total, we encountered a case where an > > > excessively large xl_tot_len actually came from a corrupted > > > record. [1] > > > > Right, I remember this one. I think that Thomas was pretty much right > > that this could be caused because of a lack of zeroing in the WAL > > pages. > > We have treated every kind of broken data as end-of-recovery, like > incorrect rm_id or prev link including excessively large record length > due to corruption. This patch is going to change the behavior only for > the last one. If you think there can't be non-zero broken data, we > should inhibit proceeding recovery after all non-zero incorrect > data. This seems to be a quite big change in our recovery policy. > > > There are a few options on the table, only doable once the WAL reader > > provider the error state to the startup process: > > 1) Retry a few times and FATAL. > > 2) Just FATAL immediately and don't wait. > > 3) Retry and hope for the best that the host calms down. > > 4) Wrap up recovery then continue to normal operation. > > This is the traditional behavior for currupt WAL data. > > > I have not seeing this issue being much of an issue in the field, so > > perhaps option 2 with the structure of 0002 and a FATAL when we catch > > XLOG_READER_OOM in the switch would be enough. At least that's enough > > for the cases we've seen. I'll think a bit more about it, as well. > > > > Yeah, agreed. That's orthogonal to the issue reported by Ethan, > > unfortunately, where he was able to trigger the issue of this thread > > by manipulating the sizing of a host after producing a record larger > > than what the host could afford after the resizing :/ > > I'm not entirely certain, but if you were to ask me which is more > probable during recovery - encountering a correct record that's too > lengthy for the server to buffer or stumbling upon a corrupt byte > sequence - I'd bet on the latter.
... of course this refers to crash recovery. For replication, we should keep retrying the current record until the operator commands promotion. > I'm not sure how often users encounter currupt WAL data, but I believe > they should have the option to terminate recovery and then switch to > normal operation. > > What if we introduced an option to increase the timeline whenever > recovery hits data error? If that option is disabled, the server stops > when recovery detects an incorrect data, except in the case of an > OOM. OOM cause record retry. regards. -- Kyotaro Horiguchi NTT Open Source Software Center