Re: Impact of checkpointer during pg_upgrade

Dilip Kumar Thu, 07 Sep 2023 03:13:11 -0700

On Thu, Sep 7, 2023 at 3:34 PM Amit Kapila <[email protected]> wrote:
>
> On Thu, Sep 7, 2023 at 12:55 PM Michael Paquier <[email protected]> wrote:
> >
> > (Just dumping what I have in mind while reading the thread.)
> >
> > On Sat, Sep 02, 2023 at 10:08:51AM +0530, Amit Kapila wrote:
> > > During pg_upgrade, we start the server for the old cluster which can
> > > allow the checkpointer to remove the WAL files. It has been noticed
> > > that we do generate certain types of WAL records (e.g
> > > XLOG_RUNNING_XACTS, XLOG_CHECKPOINT_ONLINE, and XLOG_FPI_FOR_HINT)
> > > even during pg_upgrade for old cluster, so additional WAL records
> > > could let checkpointer decide that certain WAL segments can be removed
> > > (e.g. say wal size crosses max_slot_wal_keep_size_mb) and invalidate
> > > the slots. Currently, I can't see any problem with this but for future
> > > work where we want to migrate logical slots during an upgrade[1], we
> > > need to decide what to do for such cases. The initial idea we had was
> > > that if the old cluster has some invalid slots, we won't allow an
> > > upgrade unless the user removes such slots or uses some option like
> > > --exclude-slots. It is quite possible that slots got invalidated
> > > during pg_upgrade due to no user activity. Now, even though the
> > > possibility of the same is less I think it is worth considering what
> > > should be the behavior.
> > >
> > > The other possibilities apart from not allowing an upgrade in such a
> > > case could be (a) Before starting the old cluster, we fetch the slots
> > > directly from the disk using some tool like [2] and make the decisions
> > > based on that state; (b) During the upgrade, we don't allow WAL to be
> > > removed if it can invalidate slots; (c) Copy/Migrate the invalid slots
> > > as well but for that, we need to expose an API to invalidate the
> > > slots; (d) somehow distinguish the slots that are invalidated during
> > > an upgrade and then simply copy such slots because anyway we ensure
> > > that all the WAL required by slot is sent before shutdown.
> >
> > Checking for any invalid slots at the beginning of the upgrade and
> > complain sounds like a good thing to do, because these are not
> > expected to begin with, no?  That's similar to a pre-check requirement
> > that the slots should have fed the subscribers with all the data
> > available up to the shutdown checkpoint when the publisher was stopped
> > before running pg_upgrade.  So (a) is a good idea to prevent an
> > upgrade based on a state we don't expect from the start, as something
> > in check.c, I assume.
> >
> > On top of that, (b) sounds like a good idea to me anyway to be more
> > defensive.  But couldn't we do that just like we do for autovacuum and
> > force the GUCs that could remove the slot's WAL to not do their work
> > instead?
> >
>
> I think if we just make max_slot_wal_keep_size to -1 that should be
> sufficient to not let any slots get invalidated during upgrade. Do you
> have anything else in mind?


This seems like a good solution to the problem.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Impact of checkpointer during pg_upgrade

Reply via email to