On Wed, Nov 13, 2024 at 5:25 PM Tomas Vondra <to...@vondra.me> wrote: > > On 11/13/24 05:38, Ashutosh Bapat wrote: > > ... > > > > Here's way we can fix SnapBuildProcessRunningXacts() similar to > > DecodeCommit(). DecodeCommit() uses SnapBuildXactNeedsSkip() to decide > > whether a given transaction should be decoded or not. > > /* > > * Should the contents of transaction ending at 'ptr' be decoded? > > */ > > bool > > SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr) > > { > > return ptr < builder->start_decoding_at; > > } > > > > Similar to SnapBuild::start_decoding_at we could maintain a field > > SnapBuild::start_reading_at to the LSN from which the WAL sender would > > start reading WAL. If candidate_restart_lsn produced by a running > > transactions WAL record is less than SnapBuild::start_reading_at, > > SnapBuildProcessRunningXacts() won't call > > LogicalIncreaseRestartDecodingForSlot() with that candiate LSN. We > > won't access the slot here and the solution will be inline with > > DecodeCommit() which skips the transactions. > > > > Could you maybe write a patch doing this? That would allow proper > testing etc.
Here's a quick and dirty patch which describes the idea. I didn't get time to implement code to move SnapBuild::restart_lsn if SnapBuild::start_decoding_at moves forward while building initial snapshot. I am not sure whether that's necessary either. I have added three elogs to see if the logic is working as expected. I see two of the elogs in patch in the server log when I run tests from tests/subscription and tests/recovery. But I do not see the third one. That either means that the situation causing the bug is not covered by those tests or the fix is not triggered. If you run your reproduction and still see the crashes please provide the output of those elog messages along with the rest of the elogs you have added. -- Best Wishes, Ashutosh Bapat