On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2...@gmail.com> wrote: > > Yes, I observed this same behavior. > > IIUC the only way for the tablesync worker to go from CATCHUP mode to > SYNCDONE is via the call to process_sync_tables. > > But a side-effect of this is, when messages arrive during this CATCHUP > phase one tx will be getting handled by the tablesync worker before > the process_sync_tables() is ever encountered. > > I have created and attached a simple patch which allows the tablesync > to detect if there is anything to do *before* it enters the apply main > loop. Calling process_sync_tables() before the apply main loop offers > a quick way out so the message handling will not be split > unnecessarily between the workers. >
Yeah, this demonstrates the idea can work but as mentioned in my previous email [1] this needs much more work to make the COPY and then later fetching the changes from the publisher consistently. So, let me summarize the discussion so far. We wanted to enhance the tablesync phase of Logical Replication to enable decoding of prepared transactions [2]. The problem was when we stream prepared transactions in the tablesync worker, it will simply commit the same due to the requirement of maintaining a single transaction for the entire duration of copy and streaming of transactions afterward. We can't simply disable the decoding of prepared xacts for tablesync workers because it can skip some of the prepared xacts forever on subscriber as explained in one of the emails above [3]. Now, while investigating the solutions to enhance tablesync to support decoding at prepare time, I found that due to the current design of tablesync we can see partial data of transactions on subscribers which is also explained in the email above with an example [4]. This problem of visibility is there since the Logical Replication is introduced in PostgreSQL and the only answer I got till now is that there doesn't seem to be any other alternative which I think is not true and I have provided one alternative as well. Next, we have discussed three different solutions all of which will solve the first problem (allow the tablesync worker to decode transactions at prepare time) and one of which solves both the first and second problem (partial transaction data visibility). Solution-1: Allow the table-sync worker to use multiple transactions. The reason for doing it in a single transaction is that if after initial COPY we commit and then crash while streaming changes of other transactions, the state of the table won't be known after the restart as we are using temporary slot so we don't from where to restart syncing the table. IIUC, we need to primarily do two things to achieve multiple transactions, one is to have an additional state in the catalog (say catch up) which will say that the initial copy is done. Then we need to have a permanent slot using which we can track the progress of the slot so that after restart (due to crash, connection break, etc.) we can start from the appropriate position. Now, this will allow us to do less work after recovering from a crash because we will know the restart point. As Craig mentioned, it also allows the sync slot to advance, freeing any held upstream resources before the whole sync is done, which is good if the upstream is busy and generating lots of WAL. Finally, committing as we go means we won't exceed the cid increment limit in a single txn. Solution-2: The next solution we discussed is to make "tablesync" worker just gather messages after COPY in a similar way to how the current streaming of in-progress transaction feature gathers messages into a "changes" file so that they can be replayed later by the apply worker. Now, here as we don't need to replay the individual transactions in tablesync worker in a single transaction, it will allow us to send decode prepared to the subscriber. This has some disadvantages such as each transaction processed by tablesync worker needs to be durably written to file and it can also lead to some apply lag later when we process the same by apply worker. Solution-3: Allow the table-sync workers to just perform initial COPY and then once the COPY is done for all relations the apply worker will stream all the future changes. Now, surely if large copies are required for multiple relations then we would delay a bit to replay transactions partially by the apply worker but don't know how much that matters as compared to transaction visibility issue and anyway we would have achieved the maximum parallelism by allowing copy via multiple workers. This would reduce the load (both CPU and I/O) on the publisher-side by allowing to decode the data only once instead of for each table sync worker once and separately for the apply worker. I think it will use fewer resources to finish the work. Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT option which creates a transaction snapshot on the publisher, and then we use the same snapshot for COPY from the publisher. After this, when we try to receive the data from the publisher using the same slot, it will be in sync with the COPY. I think to keep the same consistency between COPY and the data we receive from the publisher in this approach, we need to export the snapshot while creating a slot in the apply worker by using CRS_EXPORT_SNAPSHOT and then use the same snapshot by all the tablesync workers doing the copy. In tablesync workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported snapshot. That way the COPY will use the same snapshot as is used for receiving the changes in apply worker and the data will be in sync. Then we also need a way to export snapshot while the apply worker is already receiving the changes because users can use 'ALTER SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be synced. I think we need to introduce a new command in exec_replication_command() to export the snapshot from the existing slot and then use it by the new tablesync worker. Among the above three solutions, the first two will solve the first problem (allow the tablesync worker to decode transactions at prepare time) and the third solution will solve both the first and second problem (partial transaction data visibility). The third solution requires quite some redesign of how the Logical Replication work is synchronized between apply and tablesync workers and might turn out to be a bigger implementation effort. I am tentatively thinking to go with a first or second solution at this stage and anyway if later people feel that we need some bigger redesign then we can go with something on the lines of Solution-3. Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/cahut+puemk4so8ogzxc_ftzpkga8uc-y5qi-krqhsy_p0i3...@mail.gmail.com [3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com [4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com -- With Regards, Amit Kapila.