On 2021-Oct-13, Robert Haas wrote: > On Wed, Oct 13, 2021 at 2:01 AM Amul Sul <sula...@gmail.com> wrote: > > Instead of abortedRecPtr point, isn't enough to write > > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr? I think both > > are pointing to the same location then can't we use > > lastReplayedEndRecPtr instead of abortedRecPtr to write > > overwrite-contrecord and remove need of extra global variable, like > > attached? > > I think you mean missingContrecPtr, not abortedRecPtr. If I understand > correctly, abortedRecPtr is going to be the location in some WAL > segment which we replayed where a long record began, but > missingContrecPtr seems like it would have to point to the beginning > of the first segment we were unable to find to continue replay; and > thus it ought to be the same as lastReplayedEndRecPtr.
So abortedRecPtr and missingContrecPtr both point to the same long record: the former is the start of the record, and the latter is some intermediate position where we failed to find the contrecord. lastReplayedEndRecPtr is the end+1 of the record *prior* to the long record. > But the committed code doesn't seem to check that these are the same > or verify the relationship between them in any way, so I'm worried > there is some other case here. Yeah, the only reason they are the same is that xlogreader sets both to Invalid when reading a record, and then sets both when a read fails. > The comments in XLogReadRecord also suggest this: > > * We get here when a record that spans multiple pages needs to be > * assembled, but something went wrong -- perhaps a contrecord piece > * was lost. If caller is WAL replay, it will know where the aborted > > Saying that "perhaps" a contrecord piece was lost seems to imply that > other explanations are possible as well, but I'm not sure what. Other explanations are possible. Imagine cosmic rays alter one byte in the last contrecord. WAL replay will stop there, and the contrecord will have been found all right, but CRC check would have failed to pass, so we would set xlogreader->missingContrecPtr to the final contrecord of that record (I didn't actually verify this particular scenario.) In fact, anything that causes xlogreader.c:XLogReadRecord to return NULL after setting "assembled=true" would set both abortedRecPtr and missingContrecPtr -- except DecodeXLogRecord failure, which perhaps should be handled in the same way. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)