On Mon, Dec 7, 2020 at 10:02 AM Craig Ringer <craig.rin...@enterprisedb.com> wrote: > > On Mon, 7 Dec 2020 at 11:44, Peter Smith <smithpb2...@gmail.com> wrote: >> >> >> Basically, I was wondering why can't the "tablesync" worker just >> gather messages in a similar way to how the current streaming feature >> gathers messages into a "changes" file, so that they can be replayed >> later. >> > > See the related thread "Logical archiving" > > https://www.postgresql.org/message-id/20d9328b-a189-43d1-80e2-eb25b9284...@yandex-team.ru > > where I addressed some parts of this topic in detail earlier today. > >> A) The "tablesync" worker (after the COPY) does not ever apply any of >> the incoming messages, but instead it just gobbles them into a >> "changes" file until it decides it has reached SYNCDONE state and >> exits. > > > This has a few issues. > > Most importantly, the sync worker must cooperate with the main apply worker > to achieve a consistent end-of-sync cutover. >
In this idea, there is no need to change the end-of-sync cutover. It will work as it is now. I am not sure what makes you think so. > The sync worker must have replayed the pending changes in order to make this > cut-over, because the non-sync apply worker will need to start applying > changes on top of the resync'd table potentially as soon as the next > transaction it starts applying, so it needs to see the rows there. > The change here would be that the apply worker will check for changes file and if it exists then apply them before it changes the relstate to SUBREL_STATE_READY in process_syncing_tables_for_apply(). So, it will not miss seeing any rows. > Doing this would also add another round of write multiplication since the > data would get spooled then applied to WAL then heap. Write multiplication is > already an issue for logical replication so adding to it isn't particularly > desirable without a really compelling reason. > It will solve our problem of allowing decoding of prepared xacts in pgoutput. I have explained the problem above [1]. The other idea which we discussed is to allow having an additional state in pg_subscription_rel, make the slot as permanent in tablesync worker, and then process transaction-by-transaction in apply worker. Does that approach sounds better? Is there any bigger change involved in this approach (making tablesync slot permanent) which I am missing? > With the write multiplication comes disk space management issues for big > transactions as well as the obvious performance/throughput impact. > > It adds even more latency between upstream commit and downstream apply, > something that is again already an issue for logical replication. > > Right now we don't have any concept of a durable and locally flushed spool. > I think we have a concept quite close to it for writing changes for in-progress xacts as done in PG-14. It is not durable but that shouldn't be a big problem if we allow syncing the changes file. > It's not impossible to do as you suggest but the cutover requirement makes it > far from simple. As discussed in the logical archiving thread I think it'd be > good to have something like this, and there are times the write > multiplication price would be well worth paying. But it's not easy. > >> B) Then, when the "apply" worker proceeds, if it detects the existence >> of the "changes" file it will replay/apply_dispatch all those gobbled >> messages before just continuing as normal. > > > That's going to introduce a really big stall in the apply worker's progress > in many cases. During that time it won't be receiving from upstream (since we > don't spool logical changes to disk at this time) so the upstream lag will > grow. That will impact synchronous replication, pg_wal size management, > catalog bloat, etc. It'll also leave the upstream logical decoding session > idle, so when it resumes it may create a spike of I/O and CPU load as it > catches up, as well as a spike of network traffic. And depending on how close > the upstream write rate is to the max decode speed, network throughput max, > and downstream apply speed max, it may take some time to catch up over the > resulting lag. > This is just for the initial tablesync phase. I think it is equivalent to saying that during basebackup, we need to parallelly start physical replication. I agree that sometimes it can take a lot of time to copy large tables but it will be just one time and no worse than the other situations like basebackup. [1] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com -- With Regards, Amit Kapila.