Hi, and thanks for the patch! It is an interesting idea. I have not yet fully read this thread, so below are only my first impressions after looking at patch 0001. Sorry if some of these were already discussed earlier.
TBH the patch "reuse-workers" logic seemed more complicated than I had imagined it might be. 1. IIUC with patch 0001, each/every tablesync worker (a.k.a. TSW) when it finishes dealing with one table then goes looking to find if there is some relation that it can process next. So now every TSW has a loop where it will fight with every other available TSW over who will get to process the next relation. Somehow this seems all backwards to me. Isn't it strange for the TSW to be the one deciding what relation it would deal with next? IMO it seems more natural to simply return the finished TSW to some kind of "pool" of available workers and the main Apply process can just grab a TSW from that pool instead of launching a brand new one in the existing function process_syncing_tables_for_apply(). Or, maybe those "available" workers can be returned to a pool maintained within the launcher.c code, which logicalrep_worker_launch() can draw from instead of launching a whole new process? (I need to read the earlier posts to see if these options were already discussed and rejected) ~~ 2. AFAIK the thing that identifies a tablesync worker is the fact that only TSW will have a 'relid'. But it feels very awkward to me to have a TSW marked as "available" and yet that LogicalRepWorker must still have some OLD relid field value lurking (otherwise it will forget that it is a "tablesync" worker!). IMO perhaps it is time now to introduce some enum 'type' to the LogicalRepWorker. Then an "in_use" type=TSW would have a valid 'relid' whereas an "available" type=TSW would have relid == InvalidOid. ~~ 3. Maybe I am mistaken, but it seems the benchmark results posted are only using quite a small/default values for "max_sync_workers_per_subscription", so I wondered how those results are affected by increasing that GUC. I think having only very few workers would cause more sequential processing, so conveniently the effect of the patch avoiding re-launch might be seen in the best possible light. OTOH, using more TSW in the first place might reduce the overall tablesync time because the subscriber can do more work in parallel. So I'm not quite sure what the goal is here. E.g. if the user doesn't care much about how long tablesync phase takes then there is maybe no need for this patch at all. OTOH, I thought if a user does care about the subscription startup time, won't those users be opting for a much larger "max_sync_workers_per_subscription" in the first place? Therefore shouldn't the benchmarking be using a larger number too? ====== Here are a few other random things noticed while looking at patch 0001: 1. Commit message 1a. typo /sequantially/sequentially/ 1b. Saying "killed" and "killing" seemed a bit extreme and implies somebody else is killing the process. But I think mostly tablesync is just ending by a normal proc exit, so maybe reword this a bit. ~~~ 2. It seemed odd that some -- clearly tablesync specific -- functions are in the worker.c instead of in tablesync.c. 2a. e.g. clean_sync_worker 2b. e.g. sync_worker_exit ~~~ 3. process_syncing_tables_for_sync + /* + * Sync worker is cleaned at this point. It's ready to sync next table, + * if needed. + */ + SpinLockAcquire(&MyLogicalRepWorker->relmutex); + MyLogicalRepWorker->ready_to_reuse = true; SpinLockRelease(&MyLogicalRepWorker->relmutex); + } + + SpinLockRelease(&MyLogicalRepWorker->relmutex); Isn't there a double release of that mutex happening there? ------ Kind Regards, Peter Smith. Fujitsu Australia