On Mon, May 31, 2021 at 12:39 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Sat, May 29, 2021 at 3:54 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > > 1. the worker records the XID and commit LSN of the failed transaction > > > > > to a catalog. > > > > > > > > > > > > > When will you record this info? I am not sure if we can try to update > > > > this when an error has occurred. We can think of using try..catch in > > > > apply worker and then record it in catch on error but would that be > > > > advisable? One random thought that occurred to me is to that apply > > > > worker notifies such information to the launcher (or maybe another > > > > process) which will log this information. > > > > > > Yeah, I was concerned about that too and had the same idea. The > > > information still could not be written if the server crashes before > > > the launcher writes it. But I think it's an acceptable. > > > > > > > True, because even if the launcher restarts, the apply worker will > > error out again and resend the information. I guess we can have an > > error queue where apply workers can add their information and the > > launcher will then process those. If we do that, then we need to > > probably define what we want to do if the queue gets full, either > > apply worker nudge launcher and wait or it can just throw an error and > > continue. If you have any better ideas to share this information then > > we can consider those as well. > > +1 for using error queue. Maybe we need to avoid queuing the same > error more than once to avoid the catalog from being updated > frequently? >
Yes, I think it is important because after logging the subscription may still error again unless the user does something to skip or resolve the conflict. I guess you need to check for the existence of error in systable and or in the queue. > > > > > > > > I guess we can update catalog tuples in place when another conflict > > > happens next time. The catalog tuple should be fixed size. The > > > already-resolved conflict will have the commit LSN older than its > > > replication origin's LSN. > > > > > > > Okay, but I have a slight concern that we will keep xid in the system > > which might have been no longer valid. So, we will keep this info > > about subscribers around till one performs drop subscription, > > hopefully, that doesn't lead to too many rows. This will be okay as > > per the current design but say tomorrow we decide to parallelize the > > apply for a subscription then there could be multiple errors > > corresponding to a subscription and in that case, such a design might > > appear quite limiting. One possibility could be that when the launcher > > is periodically checking for new error messages, it can clean up the > > conflicts catalog as well, or maybe autovacuum does this periodically > > as it does for stats (via pgstat_vacuum_stat). > > Yeah, it's better to have a way to cleanup no longer valid entries in > the catalog in the case where the worker failed to remove it. I prefer > the former idea so far, > Which idea do you refer to here as former (cleaning up by launcher)? > so I'll implement it in a PoC patch. > Okay. -- With Regards, Amit Kapila.