On Thu, Aug 10, 2023 at 10:15 AM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Wed, Aug 9, 2023 at 8:28 AM Zhijie Hou (Fujitsu)
> <houzj.f...@fujitsu.com> wrote:
> >
> > On Thursday, August 3, 2023 7:30 PM Melih Mutlu <m.melihmu...@gmail.com>  
> > wrote:
> >
> > > Right. I attached the v26 as you asked.
> >
> > Thanks for posting the patches.
> >
> > While reviewing the patch, I noticed one rare case that it's possible that 
> > there
> > are two table sync worker for the same table in the same time.
> >
> > The patch relies on LogicalRepWorkerLock to prevent concurrent access, but 
> > the
> > apply worker will start a new worker after releasing the lock. So, at the 
> > point[1]
> > where the lock is released and the new table sync worker has not been 
> > started,
> > it seems possible that another old table sync worker will be reused for the
> > same table.
> >
> >                                 /* Now safe to release the LWLock */
> >                                 LWLockRelease(LogicalRepWorkerLock);
> > *[1]
> >                                 /*
> >                                  * If there are free sync worker slot(s), 
> > start a new sync
> >                                  * worker for the table.
> >                                  */
> >                                 if (nsyncworkers < 
> > max_sync_workers_per_subscription)
> >                                 ...
> >                                                 
> > logicalrep_worker_launch(MyLogicalRepWorker->dbid,
> >
>
> Yeah, this is a problem. I think one idea to solve this is by
> extending the lock duration till we launch the tablesync worker but we
> should also consider changing this locking scheme such that there is a
> better way to indicate that for a particular rel, tablesync is in
> progress. Currently, the code in TablesyncWorkerMain() also acquires
> the lock in exclusive mode even though the tablesync for a rel is in
> progress which I guess could easily heart us for larger values of
> max_logical_replication_workers. So, that could be another motivation
> to think for a different locking scheme.
>

Yet another problem is that currently apply worker maintains a hash
table for 'last_start_times' to avoid restarting the tablesync worker
immediately upon error. The same functionality is missing while
reusing the table sync worker. One possibility is to use a shared hash
table to remember start times but I think it may depend on what we
decide to solve the previous problem reported by Hou-San.

-- 
With Regards,
Amit Kapila.


Reply via email to