Re: Skipping logical replication transactions on subscriber side

Masahiko Sawada Tue, 30 Nov 2021 21:24:25 -0800

On Wed, Dec 1, 2021 at 1:00 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Wed, Dec 1, 2021 at 9:12 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > On Wed, Dec 1, 2021 at 12:22 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> > >
> > > On Wed, Dec 1, 2021 at 8:24 AM houzj.f...@fujitsu.com
> > > <houzj.f...@fujitsu.com> wrote:
> > > >
> > > > I have a question about the testcase (I could be wrong here).
> > > >
> > > > Is it possible that the race condition happen between apply 
> > > > worker(test_tab1)
> > > > and table sync worker(test_tab2) ? If so, it seems the 
> > > > error("replication
> > > > origin with OID") could happen randomly until we resolve the conflict.
> > > > Based on this, for the following code:
> > > > -----
> > > >     # Wait for the error statistics to be updated.
> > > >     my $check_sql = qq[SELECT count(1) > 0 ] . $part_sql;
> > > >     $node->poll_query_until(
> > > >         'postgres', $check_sql,
> > > > ) or die "Timed out while waiting for statistics to be updated";
> > > >
> > > > * [1] *
> > > >
> > > >     $check_sql =
> > > >         qq[
> > > > SELECT subname, last_error_command, last_error_relid::regclass,
> > > > last_error_count > 0 ] . $part_sql;
> > > >     my $result = $node->safe_psql('postgres', $check_sql);
> > > >     is($result, $expected, $msg);
> > > > -----
> > > >
> > > > Is it possible that the error("replication origin with OID") happen 
> > > > again at the
> > > > place [1]. In this case, the error message we have checked could be 
> > > > replaced by
> > > > another error("replication origin ...") and then the test fail ?
> > > >
> > >
> > > Once we get the "duplicate key violation ..." error before * [1] * via
> > > apply_worker then we shouldn't get replication origin-specific error
> > > because the origin set up is done before starting to apply changes.
> >
> > Right.
> >
> > > Also, even if that or some other happens after * [1] * because of
> > > errmsg_prefix check it should still succeed.
> >
> > In this case, the old error ("duplicate key violation ...") is
> > overwritten by a new error (e.g., connection error. not sure how
> > possible it is)
> >
>
> Yeah, or probably some memory allocation failure. I think the
> probability of such failures is very low but OTOH why take chance.
>
> > and the test fails because the query returns no
> > entries, no?
> >
>
> Right.
>
> > If so, the result from the second check_sql is unstable
> > and it's probably better to check the result only once. That is, the
> > first check_sql includes the command and we exit from the function
> > once we confirm the error entry is expectedly updated.
> >
>
> Yeah, I think that should be fine.


Okay, I've attached an updated patch. Please review it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

v2-0001-Fix-regression-test-failure-caused-by-commit-8d74.patch
Description: Binary data

Re: Skipping logical replication transactions on subscriber side

Reply via email to