Hi Shveta and Shi, Thanks for your investigations.
shveta malik <shveta.ma...@gmail.com>, 8 Şub 2023 Çar, 16:49 tarihinde şunu yazdı: > On Tue, Feb 7, 2023 at 8:18 AM shiy.f...@fujitsu.com > <shiy.f...@fujitsu.com> wrote: > > I reproduced the problem I reported before with latest patch (v7-0001, > > v10-0002), and looked into this problem. It is caused by a similar > reason. Here > > is some analysis for the problem I reported [1].#6. > > > > First, a tablesync worker (worker-1) started for "tbl1", its originname > is > > "pg_16398_1". And it exited because of unique constraint. In > > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated > when > > updating table state to DATASYNC, and the origin is created when > updating table > > state to FINISHEDCOPY. So when it exited with state DATASYNC , the > origin is not > > created but the originname has been updated in pg_subscription_rel. > > > > Then a tablesync worker (worker-2) started for "tbl2", its originname is > > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to > sync > > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of > "tbl1" - > > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried > to drop > > the origin (although it is not actually created before). After that, it > called > > replorigin_by_name to get the originid whose name is "pg_16398_1" and > the result > > is InvalidOid. Origin won't be created in this case because the sync > worker has > > created a replication slot (when it synced tbl2), so the originid was > still > > invalid and it caused an assertion failure when calling > replorigin_advance(). > > > > It seems we don't need to drop previous origin in worker-2 because the > previous > > origin was not created in worker-1. I think one way to fix it is to not > update > > originname of pg_subscription_rel when setting state to DATASYNC, and > only do > > that when setting state to FINISHEDCOPY. If so, the originname in > > pg_subscription_rel will be set at the same time the origin is created. > > +1. Update of system-catalog needs to be done carefully and only when > origin is created. > I see that setting originname in the catalog before actually creating it causes issues. My concern with setting originname when setting the state to FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update slot/origin name but it fails somewhere before reaching FINISHEDCOPY and after creating slot/origin, then this new created slot/origin will be left behind. It wouldn't be possible to find and drop them since their names are not stored in the catalog. Eventually, this might also cause hitting the max_replication_slots limit in case of such failures between origin creation and FINISHEDCOPY. One fix I can think is to update the catalog right after creating a new origin. But this would also require commiting the current transaction to actually persist the originname. I guess this action of commiting the transaction in the middle of initial sync could hurt the copy process. What do you think? Also; working on an updated patch to address your other comments. Thanks again. Best, -- Melih Mutlu Microsoft