Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

Amit Kapila Sun, 25 May 2025 23:55:55 -0700

On Sat, May 24, 2025 at 6:59 PM Alexander Korotkov <aekorot...@gmail.com> wrote:
>
> Hi, Amit!
>
> Thank you for your attention to this patchset!
>
> On Sat, May 24, 2025 at 2:15 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> > On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov <aekorot...@gmail.com> 
> > wrote:
> > >
> > > I spend more time on this.  The next revision is attached.  It
> > > contains revised comments and other cosmetic changes.  I'm going to
> > > backpatch 0001 to all supported branches,
> > >
> >
> > Is my understanding correct that we need 0001 because
> > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after
> > changing the slot's restart_lsn?
>
> Yes.  Also, even if it would save slot to the disk, there is still
> race condition that concurrent checkpoint could use updated value from
> the shared memory to clean old WAL segments, and then crash happens
> before we managed to write the slot to the disk.
>


How can that happen, if we first write the updated value to disk and
then update the shared memory as we do in
LogicalConfirmReceivedLocation?

BTW, won't there be a similar problem with physical slot's xmin
computation as well? In PhysicalReplicationSlotNewXmin(), after
updating the slot's xmin computation, we mark it as dirty and update
shared memory values. Now, say after checkpointer writes these xmin
values to disk, walsender receives another feedback message and
updates the slot's xmin values. Now using these updated shared memory
values, vacuum removes the rows, however, a restart will show the
older xmin values in the slot, which mean vacuum would have removed
the required rows before restart.

As per my understanding, neither the xmin nor the LSN problem exists
for logical slots. I am pointing this out to indicate we may need to
think of a different solution for physical slots, if these are
problems only for physical slots.

-- 
With Regards,
Amit Kapila.

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

Reply via email to