Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-20 Thread Lars Vonk
Hi,

Just wondering if someone knows how this could have happened? Did we miss
out on something when setting up the logical replication? Are there any
scenario's in which this could happen (like database restart or anything
else?).
Or should I report this a bug (although I can't image it is)?
We really would like to know how we can prevent this from happening the
next time.

We still have the old primary, and a snapshot of the current primary around
the time we flipped from the old to the new. So we could some digging into
the cause, but we don't know what to look for...

Any help or tips are appreciated.

Thanks in advance,

Lars


On Fri, Dec 18, 2020 at 4:42 PM Lars Vonk  wrote:

> Hi,
>
> We migrated from postgres 11 to 12 using logical replication (over local
> network). Today we noticed that one table is missing 1252 rows after the
> replication finished and we flipped to the new primary (we still have the
> old master database so we can recover).
>
> We see that these rows were inserted in the table after starting the
> initial copy of the table. Most of the missing rows seem from new inserts
> happening **during the initial copy** (1230) and the rest (22) from inserts
> **during the period the replication ran** (7 days).
>
> After further investigation unfortunately more tables have missing rows,
> all of them are after the initial table copy phase. We took a per-table
> approach for the replication, starting with creating an empty publication
> and adding tables via
>
> ALTER PUBLICATION pg12_migration ADD TABLE FOO
>
> After that we refreshed the publication on the "new postgres 12 primary"
> using
>
> ALTER SUBSCRIPTION pg12_migration REFRESH PUBLICATION;
>
> We only added new tables after the the initial copy of the previous was
> done (the internal state was replicating).
>
> We never stopped the subscriptions during all this and we started with a
> fresh schema.
>
> We did some sanity checks before we switched to the new master, like
> comparing max(id) to see if the replica was up to date (including this
> table) and counts on some smaller tables and that all checked out okay, we
> never thought of missing rows somewhere in between
>
> So how can this happen?
>
> Lars
>


Re: Missing rows after migrating from postgres 11 to 12 with logical replication

2020-12-20 Thread Adrian Klaver

On 12/20/20 8:33 AM, Lars Vonk wrote:

Hi,

Just wondering if someone knows how this could have happened? Did we 
miss out on something when setting up the logical replication? Are there 
any scenario's in which this could happen (like database restart or 
anything else?).

Or should I report this a bug (although I can't image it is)?
We really would like to know how we can prevent this from happening the 
next time.


We still have the old primary, and a snapshot of the current primary 
around the time we flipped from the old to the new. So we could some 
digging into the cause, but we don't know what to look for...




Questions I have:

1) Was there activity on the 12 instance while it was being replicated 
to that could account for the missing(deleted?) rows?


2) Are the logs still available for inspection to see if there where any 
errors thrown?


3) Are there FK relationships involved?

4) How did you determine the rows where missing?



Any help or tips are appreciated.

Thanks in advance,

Lars


On Fri, Dec 18, 2020 at 4:42 PM Lars Vonk > wrote:


Hi,

We migrated from postgres 11 to 12 using logical replication (over
local network). Today we noticed that one table is missing 1252 rows
after the replication finished and we flipped to the new primary (we
still have the old master database so we can recover).

We see that these rows were inserted in the table after starting the
initial copy of the table. Most of the missing rows seem from new
inserts happening **during the initial copy** (1230) and the rest
(22) from inserts **during the period the replication ran** (7 days).

After further investigation unfortunately more tables have missing
rows, all of them are after the initial table copy phase. We took a
per-table approach for the replication, starting with creating an
empty publication and adding tables via

ALTER PUBLICATION pg12_migration ADD TABLE FOO

After that we refreshed the publication on the "new postgres 12
primary" using

ALTER SUBSCRIPTION pg12_migration REFRESH PUBLICATION;

We only added new tables after the the initial copy of the previous
was done (the internal state was replicating).

We never stopped the subscriptions during all this and we started
with a fresh schema.

We did some sanity checks before we switched to the new master, like
comparing max(id) to see if the replica was up to date (including
this table) and counts on some smaller tables and that all checked
out okay, we never thought of missing rows somewhere in between

So how can this happen?

Lars




--
Adrian Klaver
adrian.kla...@aklaver.com




some questions regarding replication issues and timeline/history files

2020-12-20 Thread Marcin Giedz
Hello, 

I've created synchronous replication between Primary and Secondary server and 
put pacemaker + PAF over. Client is doing some stress tests (switching nodes, 
disabling particular node, etc) and it's second time when this kind of problem 
occoures : 

2020-12-18 14:03:46.658 CET [unknown] [28787]ERROR: requested starting point 
F/A200 on timeline 39 is not in this server's history 
2020-12-18 14:03:46.658 CET [unknown] [28787]DETAIL: This server's history 
forked from timeline 39 at F/A1023338. 

questions: 
1. what does this mean ? How this can happen ? Does this mean that for some 
"point in time" both servers were primary ? 
2. in xxx.history files I find the following rows: 
43 F/A6A0 no recovery target specified 
44 F/A7A0 no recovery target specified 
45 F/A8A0 no recovery target specified 

again: what does this mean ? 

3. general question: can anyone suggest deep explanation of timelines and 
history files to understand how this postgresql nature works ? 



Many thx 
Marcin 


Prevent pg_rewind destroying the data

2020-12-20 Thread Christopher Pereira

Hi,

When pg_rewind is interrupted due to network errors, the cluster gets 
corrupted:


Running pg_rewind for a second time returns "pg_rewind: fatal: target 
server must be shut down cleanly".


Trying to fix the cluster with "/usr/pgsql-12/bin/postmaster' --single 
-F -D '/var/lib/pgsql/12/mydb' -c archive_mode=on -c 
archive_command=false" throws:


   LOG:  could not read from log segment 003B003E,
   offset 0: read 0 of 8192
   LOG:  invalid primary checkpoint record
   PANIC:  could not locate a valid checkpoint record

When a cluster failsover because of a network problem, chances are high 
that another network problem may occur while we run pg_rewind.
It would be nice if pg_rewind wouldn't destroy the data and leave the 
cluster in a state where retrying pg_rewind can succeed.


As a workaround we are thinking in taking a LVM snapshot or do a "cp 
--reflink" before running pg_rewind and restore if there is a failure, 
but it would be nice if pg_rewind were "non destructive".


Is this possible?
Am I missing something?

We are using PG 12.