Hi Adrian,

Thanks for taking the time to reply!

First to answer your questions:

1) Was there activity on the 12 instance while it was being replicated
> to that could account for the missing(deleted?) rows?
>

No there was no activity other than us doing some queries to check how far
the replication was.

2) Are the logs still available for inspection to see if there where any
> errors thrown?
>

Yes, and we dug into those. And we also found some indications that
something went wrong.

3) Are there FK relationships involved?
>

No

4) How did you determine the rows where missing?


We were alerted by a bug later that day and found that some rows were
missing in the new primary. We did a compare based on primary key and found
that several tables were missing rows. Before the switch we unfortunately
only checked max(id) and did some counts on tables and those all checked
out. We didn't do a count on all tables...

So to come back at the logs:

We dug a little deeper and we did found ERROR logs around the time we ran
the initial copies. During a period of several hours that day we see a
couple of messages like:

ERROR: requested WAL segment 00000001000001F10000001D has already been
> removed
>

This message is logged a few times and then no more (perhaps it recovered
from it?)

Other than this error there are no other errors, but unfortunately we never
checked this before migrating to the new primary...
In hindsight not very smart of course, but we never thought of this because:

a) the initial copy and the catching up all seemed fine;
b) in previous attempts when we made some errors we noticed for instance
that the WAL files on the previous primary were kept because the new
primary did not yet process them.
So we assumed when all WAL files are "gone" and the max(id) checks out the
replica is in sync and consistent with the primary;
c) our experience with hotstandby replication is that whenever a WAL
segment is missing it won't skip over it and wait until you restore it. We
assumed (and still assume) that this was also the case with logical
replication;

So the questions we now have is:

1) is it correct that a replica of a logical replication skips over missing
WAL files.
2) if so how can you know that it skipped a wal without looking at the log
files or doing a full count?
3) Is there a fail fast mechanism for logical replication (like hotstandy)
that when a WAL file is missing that it stops with further replication

Regards,
Lars

On Sun, Dec 20, 2020 at 6:58 PM Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 12/20/20 8:33 AM, Lars Vonk wrote:
> > Hi,
> >
> > Just wondering if someone knows how this could have happened? Did we
> > miss out on something when setting up the logical replication? Are there
> > any scenario's in which this could happen (like database restart or
> > anything else?).
> > Or should I report this a bug (although I can't image it is)?
> > We really would like to know how we can prevent this from happening the
> > next time.
> >
> > We still have the old primary, and a snapshot of the current primary
> > around the time we flipped from the old to the new. So we could some
> > digging into the cause, but we don't know what to look for...
> >
>
> Questions I have:
>
> 1) Was there activity on the 12 instance while it was being replicated
> to that could account for the missing(deleted?) rows?
>
> 2) Are the logs still available for inspection to see if there where any
> errors thrown?
>
> 3) Are there FK relationships involved?
>
> 4) How did you determine the rows where missing?
>
>
> > Any help or tips are appreciated.
> >
> > Thanks in advance,
> >
> > Lars
> >
> >
> > On Fri, Dec 18, 2020 at 4:42 PM Lars Vonk <lars.v...@gmail.com
> > <mailto:lars.v...@gmail.com>> wrote:
> >
> >     Hi,
> >
> >     We migrated from postgres 11 to 12 using logical replication (over
> >     local network). Today we noticed that one table is missing 1252 rows
> >     after the replication finished and we flipped to the new primary (we
> >     still have the old master database so we can recover).
> >
> >     We see that these rows were inserted in the table after starting the
> >     initial copy of the table. Most of the missing rows seem from new
> >     inserts happening **during the initial copy** (1230) and the rest
> >     (22) from inserts **during the period the replication ran** (7 days).
> >
> >     After further investigation unfortunately more tables have missing
> >     rows, all of them are after the initial table copy phase. We took a
> >     per-table approach for the replication, starting with creating an
> >     empty publication and adding tables via
> >
> >     ALTER PUBLICATION pg12_migration ADD TABLE FOO
> >
> >     After that we refreshed the publication on the "new postgres 12
> >     primary" using
> >
> >     ALTER SUBSCRIPTION pg12_migration REFRESH PUBLICATION;
> >
> >     We only added new tables after the the initial copy of the previous
> >     was done (the internal state was replicating).
> >
> >     We never stopped the subscriptions during all this and we started
> >     with a fresh schema.
> >
> >     We did some sanity checks before we switched to the new master, like
> >     comparing max(id) to see if the replica was up to date (including
> >     this table) and counts on some smaller tables and that all checked
> >     out okay, we never thought of missing rows somewhere in between....
> >
> >     So how can this happen?
> >
> >     Lars
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Reply via email to