On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > <bertranddrouvot...@gmail.com> wrote: > > > > > Unrelated to above, if there is a user slot on standby with the same > > > name which the slot-sync worker is trying to create, then shall it > > > emit a warning and skip the sync of that slot or shall it throw an > > > error? > > > > > > > I'd vote for emit a warning and move on to the next slot if any. > > > > But then it could take time for users to know the actual problem and > they probably notice it after failover. OTOH, if we throw an error > then probably they will come to know earlier because the slot sync > mechanism would be stopped. Do you have reasons to prefer giving a > WARNING and skipping creating such slots? I expect this WARNING to > keep getting repeated in LOGs because the consecutive sync tries will > again generate a WARNING. >
Apart from the above, I would like to discuss the slot sync work distribution strategy of this patch. The current implementation as explained in the commit message [1] works well if the slots belong to multiple databases. It is clear from the data in emails [2][3][4] that having more workers really helps if the slots belong to multiple databases. But I think if all the slots belong to one or very few databases then such a strategy won't be as good. Now, on one hand, we get very good numbers for a particular workload with the strategy used in the patch but OTOH it may not be adaptable to various different kinds of workloads. So, I have a question whether we should try to optimize this strategy for various kinds of workloads or for the first version let's use a single-slot sync-worker and then we can enhance the functionality in later patches either in PG17 itself or in PG18 or later versions. One thing to note is that a lot of the complexity of the patch is attributed to the multi-worker strategy which may still not be efficient, so there is an argument to go with a simpler single-slot sync-worker strategy and then enhance it in future versions as we learn more about various workloads. It will also help to develop this feature incrementally instead of doing all the things in one go and taking a much longer time than it should. Thoughts? [1] - "The replication launcher on the physical standby queries primary to get the list of dbids for failover logical slots. Once it gets the dbids, if dbids < max_slotsync_workers, it starts only that many workers, and if dbids > max_slotsync_workers, it starts max_slotsync_workers and divides the work equally among them. Each worker is then responsible to keep on syncing the logical slots belonging to the DBs assigned to it. Each slot-sync worker will have its own dbids list. Since the upper limit of this dbid-count is not known, it needs to be handled using dsa. We initially allocated memory to hold 100 dbids for each worker. If this limit is exhausted, we reallocate this memory with size incremented again by 100." [2] - https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com [4] - https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com -- With Regards, Amit Kapila.