Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Peter Smith Tue, 23 May 2023 20:00:02 -0700

Hi, and thanks for the patch! It is an interesting idea.

I have not yet fully read this thread, so below are only my first
impressions after looking at patch 0001. Sorry if some of these were
already discussed earlier.


TBH the patch "reuse-workers" logic seemed more complicated than I had
imagined it might be.

1.
IIUC with patch 0001, each/every tablesync worker (a.k.a. TSW) when it
finishes dealing with one table then goes looking to find if there is
some relation that it can process next. So now every TSW has a loop
where it will fight with every other available TSW over who will get
to process the next relation.

Somehow this seems all backwards to me. Isn't it strange for the TSW
to be the one deciding what relation it would deal with next?

IMO it seems more natural to simply return the finished TSW to some
kind of "pool" of available workers and the main Apply process can
just grab a TSW from that pool instead of launching a brand new one in
the existing function process_syncing_tables_for_apply(). Or, maybe
those "available" workers can be returned to a pool maintained within
the launcher.c code, which logicalrep_worker_launch() can draw from
instead of launching a whole new process?

(I need to read the earlier posts to see if these options were already
discussed and rejected)

~~

2.
AFAIK the thing that identifies a  tablesync worker is the fact that
only TSW will have a 'relid'.

But it feels very awkward to me to have a TSW marked as "available"
and yet that LogicalRepWorker must still have some OLD relid field
value lurking (otherwise it will forget that it is a "tablesync"
worker!).

IMO perhaps it is time now to introduce some enum 'type' to the
LogicalRepWorker. Then an "in_use" type=TSW would have a valid 'relid'
whereas an "available" type=TSW would have relid == InvalidOid.

~~

3.
Maybe I am mistaken, but it seems the benchmark results posted are
only using quite a small/default values for
"max_sync_workers_per_subscription", so I wondered how those results
are affected by increasing that GUC. I think having only very few
workers would cause more sequential processing, so conveniently the
effect of the patch avoiding re-launch might be seen in the best
possible light. OTOH, using more TSW in the first place might reduce
the overall tablesync time because the subscriber can do more work in
parallel.

So I'm not quite sure what the goal is here. E.g. if the user doesn't
care much about how long tablesync phase takes then there is maybe no
need for this patch at all. OTOH, I thought if a user does care about
the subscription startup time, won't those users be opting for a much
larger "max_sync_workers_per_subscription" in the first place?
Therefore shouldn't the benchmarking be using a larger number too?

======

Here are a few other random things noticed while looking at patch 0001:

1. Commit message

1a. typo /sequantially/sequentially/

1b. Saying "killed" and "killing" seemed a bit extreme and implies
somebody else is killing the process. But I think mostly tablesync is
just ending by a normal proc exit, so maybe reword this a bit.

~~~

2. It seemed odd that some -- clearly tablesync specific -- functions
are in the worker.c instead of in tablesync.c.

2a. e.g. clean_sync_worker

2b. e.g. sync_worker_exit

~~~

3. process_syncing_tables_for_sync

+ /*
+ * Sync worker is cleaned at this point. It's ready to sync next table,
+ * if needed.
+ */
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+ MyLogicalRepWorker->ready_to_reuse = true;
  SpinLockRelease(&MyLogicalRepWorker->relmutex);
+ }
+
+ SpinLockRelease(&MyLogicalRepWorker->relmutex);

Isn't there a double release of that mutex happening there?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Reply via email to