The tablesync worker in logical replication performs the table data sync in a single transaction which means it will copy the initial data and then catch up with apply worker in the same transaction. There is a comment in LogicalRepSyncTableStart ("We want to do the table data sync in a single transaction.") saying so but I can't find the concrete theory behind the same. Is there any fundamental problem if we commit the transaction after initial copy and slot creation in LogicalRepSyncTableStart and then allow the apply of transactions as it happens in apply worker? I have tried doing so in the attached (a quick prototype to test) and didn't find any problems with regression tests. I have tried a few manual tests as well to see if it works and didn't find any problem. Now, it is quite possible that it is mandatory to do the way we are doing currently, or maybe something else is required to remove this requirement but I think we can do better with respect to comments in this area.
The reason why I am looking into this area is to support the logical decoding of prepared transactions. See the problem [1] reported by Peter Smith. Basically, when we stream prepared transactions in the tablesync worker, it will simply commit the same due to the requirement of maintaining a single transaction for the entire duration of copy and streaming of transactions. Now, we can fix that problem by disabling the decoding of prepared xacts in tablesync worker. But that will arise to a different kind of problems like the prepare will not be sent by the publisher but a later commit might move lsn to a later step which will allow it to catch up till the apply worker. So, now the prepared transaction will be skipped by both tablesync and apply worker. I think apart from unblocking the development of 'logical decoding of prepared xacts', it will make the code consistent between apply and tablesync worker and reduce the chances of future bugs in this area. Basically, it will reduce the checks related to am_tablesync_worker() at various places in the code. I see that this code is added as part of commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920 (Logical replication support for initial data copy). Thoughts? [1] - https://www.postgresql.org/message-id/cahut+puemk4so8ogzxc_ftzpkga8uc-y5qi-krqhsy_p0i3...@mail.gmail.com -- With Regards, Amit Kapila.
v1-0001-Allow-more-than-one-transaction-in-tablesync-work.patch
Description: Binary data