Re: Synchronizing slots from primary to standby

Amit Kapila Wed, 08 Nov 2023 19:26:24 -0800

On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand
> <bertranddrouvot...@gmail.com> wrote:
> >
> > > Unrelated to above, if there is a user slot on standby with the same
> > > name which the slot-sync worker is trying to create, then shall it
> > > emit a warning and skip the sync of that slot or shall it throw an
> > > error?
> > >
> >
> > I'd vote for emit a warning and move on to the next slot if any.
> >
>
> But then it could take time for users to know the actual problem and
> they probably notice it after failover. OTOH, if we throw an error
> then probably they will come to know earlier because the slot sync
> mechanism would be stopped. Do you have reasons to prefer giving a
> WARNING and skipping creating such slots? I expect this WARNING to
> keep getting repeated in LOGs because the consecutive sync tries will
> again generate a WARNING.
>

Apart from the above, I would like to discuss the slot sync work
distribution strategy of this patch. The current implementation as
explained in the commit message [1] works well if the slots belong to
multiple databases. It is clear from the data in emails [2][3][4] that
having more workers really helps if the slots belong to multiple
databases. But I think if all the slots belong to one or very few
databases then such a strategy won't be as good. Now, on one hand, we
get very good numbers for a particular workload with the strategy used
in the patch but OTOH it may not be adaptable to various different
kinds of workloads. So, I have a question whether we should try to
optimize this strategy for various kinds of workloads or for the first
version let's use a single-slot sync-worker and then we can enhance
the functionality in later patches either in PG17 itself or in PG18 or
later versions. One thing to note is that a lot of the complexity of
the patch is attributed to the multi-worker strategy which may still
not be efficient, so there is an argument to go with a simpler
single-slot sync-worker strategy and then enhance it in future
versions as we learn more about various workloads. It will also help
to develop this feature incrementally instead of doing all the things
in one go and taking a much longer time than it should.

Thoughts?

[1] - "The replication launcher on the physical standby queries
primary to get the list of dbids for failover logical slots. Once it
gets the dbids, if dbids < max_slotsync_workers, it starts only that
many workers, and if dbids > max_slotsync_workers, it starts
max_slotsync_workers and divides the work equally among them. Each
worker is then responsible to keep on syncing the logical slots
belonging to the DBs assigned to it.

Each slot-sync worker will have its own dbids list. Since the upper
limit of this dbid-count is not known, it needs to be handled using
dsa. We initially allocated memory to hold 100 dbids for each worker.
If this limit is exhausted, we reallocate this memory with size
incremented again by 100."

[2] -
https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com
[3] -
https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com
[4] -
https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

Reply via email to