TOAST corruption in standby database

2019-10-24 Thread Alex Adriaanse
We have primary and hot standby databases running Postgres 11.3 inside Docker, 
with their data directories bind-mounted to a reflink-enabled XFS filesystem. 
The VM is running Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance.

I've seen TOAST corruption in one of the standby databases a few months ago in 
a ~44GB table, so I wiped the database and rebuilt it using pg_basebackup, 
which eliminated the corruption. This week I've seen corruption pop up again in 
the same table in one of the standby databases. The other standby database 
experienced no corruption.

The corrupted table has four columns of types integer, text, text, and jsonb. 
The corruption happens inside the jsonb column.

The corruption manifests itself as follows in the standby database:

SELECT length(json::text) FROM  WHERE identity = '...';
ERROR:  missing chunk number 0 for toast value 64265646 in pg_toast_16103925

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 ctid | chunk_id | chunk_seq | md5
--+--+---+-
(0 rows)

SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 count
---
 2
(1 row)

Looking up the TOAST block that is supposed to contain this value you can see 
that the TOAST tuples are missing:

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', 
'(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)');
ctid | chunk_id | chunk_seq |   md5
-+--+---+--
 (1793121,3) | 41259162 | 0 | 1bff36f306bac135cce9da44dd6d6bbb
 (1793121,4) | 41259162 | 1 | b754d688c5c847c7bc519e65741ffef1
 (1793121,5) | 41259163 | 0 | 10dfa4f5b3e32188f0b4b28c9be76abe
 (1793121,6) | 41259163 | 1 | 7dceb98b2c2f4ac3c72245c58c85323f
(4 rows)

For comparison here are the same queries against the primary database:

SELECT length(json::text) FROM  WHERE identity = '...';
 length

   7817
(1 row)

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
ctid | chunk_id | chunk_seq |   md5
-+--+---+--
 (1793121,1) | 64265646 | 0 | a9a2642e8408fc178fb809b86c430997
 (1793121,2) | 64265646 | 1 | 371bc2628ac5bfc8b37d32f93d08fefe
(2 rows)

SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 count
---
 2
(1 row)

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', 
'(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)');
ctid | chunk_id | chunk_seq |   md5
-+--+---+--
 (1793121,1) | 64265646 | 0 | a9a2642e8408fc178fb809b86c430997
 (1793121,2) | 64265646 | 1 | 371bc2628ac5bfc8b37d32f93d08fefe
 (1793121,3) | 41259162 | 0 | 1bff36f306bac135cce9da44dd6d6bbb
 (1793121,4) | 41259162 | 1 | b754d688c5c847c7bc519e65741ffef1
 (1793121,5) | 41259163 | 0 | 10dfa4f5b3e32188f0b4b28c9be76abe
 (1793121,6) | 41259163 | 1 | 7dceb98b2c2f4ac3c72245c58c85323f
(6 rows)

Looking at the data file for the TOAST relation, the header data structures in 
the relevant block seem fine to me, which makes me think this is not caused by 
filesystem corruption (unless a write silently failed). The second half of that 
block is identical between the primary and corrupted standby, but in the first 
half the corrupted standby is missing data.

Standby (corrupted):

# dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | 
hexdump -C | head -8
  a3 0e 00 00 48 46 88 0e  00 00 05 00 30 00 58 0f  |HF..0.X.|
0010  00 20 04 20 00 00 00 00  00 00 00 00 00 00 00 00  |. . |
0020  10 98 e0 0f 98 97 e8 00  a8 8f e0 0f 58 8f 96 00  |X...|
0030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0f50  00 00 00 00 00 00 00 00  32 b0 0a 01 00 00 00 00  |2...|
0f60  00 00 00 00 1b 00 61 5c  06 00 03 00 02 09 18 00  |..a\|
0f70  9b 90 75 02 01 00 00 00  ac 00 00 00 83 9f 64 00  |..u...d.|

Primary:

# dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | 
hexdump -C | head -8
  bd 0e 00 00 08 ad 32 b7  00 00 05 00 30 00 90 04  |..2.0...|
0010  00 20 04 20 00 00 00 00  68 87 e0 0f 90 84 a8 05  |. . h...|
0020  10 98 e0 0f 98 97 e8 00  a8 8f e0 0f 58 8f 96 00  |X...|
0030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0490  a6 07 7e 02 00 00 00 00  00 00 00 00 1b 00 61 5c  |..~...a\|
04a0  02 00 03 00 02 09 18 00  ae 9d d4 03 01 00 00 00  |

Corruption with duplicate primary key

2019-12-05 Thread Alex Adriaanse
We have a Postgres 10 database that we recently upgraded to Postgres 12 using 
pg_upgrade. We recently discovered that there are rows in one of the tables 
that have duplicate primary keys:

record_loader=# \d loader.sync
  Table "loader.sync"
  Column   |   Type   | Collation | Nullable | Default
---+--+---+--+-
 source| text |   | not null |
 natural_key   | text |   | not null |
 payload   | jsonb|   |  |
 dispatched| timestamp with time zone |   | not null | now()
 initial_load_id   | text |   |  |
 deleted_load_id   | text |   |  |
 created_timestamp | timestamp with time zone |   |  | now()
 updated_timestamp | timestamp with time zone |   |  | now()
 deleted_timestamp | timestamp with time zone |   |  |
Indexes:
"sync_pkey" PRIMARY KEY, btree (source, natural_key)
Publications:
"debezium"

This table is modified via triggers that fire off when a COPY command inserts 
many rows into another table.

Here are two example duplicate rows:

# SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, 
dispatched, created_timestamp, updated_timestamp, deleted_timestamp FROM 
loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY 
xmin::text::int, cmin::text::int;
-[ RECORD 1 ]-+-
xmin  | 116649
xmax  | 0
cmin  | 5304404
cmax  | 5304404
source| ok_lease
natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
dispatched| 2019-11-24 05:09:36.099686+00
created_timestamp | 2019-11-24 05:09:36.099686+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp |
-[ RECORD 2 ]-+-
xmin  | 116649
xmax  | 118583
cmin  | 5312208
cmax  | 5312208
source| ok_lease
natural_key_hash  | de3e9a567b90025c3399c4c63c823fe9
dispatched| 2019-11-10 05:09:24.214964+00
created_timestamp | 2019-05-17 21:24:19.558219+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp | 2019-11-24 05:09:36.099686+00

It appears that the second row was in place originally, then got updated by a 
trigger (and even deleted later on, although it doesn't appear that the delete 
transaction got committed), and then the first row was inserted within the same 
transaction that updated the second row.

Another example:
-[ RECORD 1 ]-+-
xmin  | 116649
xmax  | 0
cmin  | 5304403
cmax  | 5304403
source| ok_lease
natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
dispatched| 2019-11-10 05:09:24.214964+00
created_timestamp | 2019-05-31 06:00:33.765547+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp | 2019-11-24 05:09:36.099686+00
-[ RECORD 2 ]-+-
xmin  | 116649
xmax  | 0
cmin  | 5304404
cmax  | 5304404
source| ok_lease
natural_key_hash  | 1c8031348701a32cb5fee26839d6b0b4
dispatched| 2019-11-24 05:09:36.099686+00
created_timestamp | 2019-11-24 05:09:36.099686+00
updated_timestamp | 2019-11-24 05:09:36.099686+00
deleted_timestamp |

Both examples have in common that the two duplicate rows were touched within 
the same transaction.

This database runs inside Docker, with the data directory bind-mounted to a 
reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 
kernel inside an AWS EC2 instance. We have Debezium stream data from this 
database via pgoutput.

Recreating the primary key confirms that the constraint doesn't (or at least 
shouldn't) permit these duplicate rows:

record_loader=# BEGIN;
BEGIN
record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey;
ALTER TABLE
record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY 
(source, natural_key);
ERROR:  could not create unique index "sync_pkey"
DETAIL:  Key (source, natural_key)=(ok_lease, ...) is duplicated.
CONTEXT:  parallel worker

Any ideas on what might cause this behavior?

Thanks,

Alex



Re: Corruption with duplicate primary key

2019-12-11 Thread Alex Adriaanse
On Thu, December 5, 2019 at 5:34 PM Peter Geoghegan wrote:
> > We have a Postgres 10 database that we recently upgraded to Postgres 12 
> > using pg_upgrade. We recently discovered that there are rows in one of the 
> > tables that have duplicate primary keys:
> 
> What's the timeline here? In other words, does it look like these rows
> were updated and/or deleted before, around the same time as, or after
> the upgrade?

The Postgres 12 upgrade was performed on 2019-11-22, so the affected rows were 
modified after this upgrade (although some of the rows were originally inserted 
before then, before they were modified/duplicated).

> > This database runs inside Docker, with the data directory bind-mounted to a 
> > reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 
> > kernel inside an AWS EC2 instance. We have Debezium stream data from this 
> > database via pgoutput.
> 
> That seems suspicious, since reflink support for XFS is rather immature.

Good point. Looking at kernel commits since 4.19.16 it appears that there have 
been a few bug fixes in later kernel versions that address a few XFS corruption 
issues. Regardless of whether FS bugs are responsible of this corruption I'll 
plan on upgrading to a newer kernel.

> How did you invoke pg_upgrade? Did you use the --link (hard link) option?

Yes, we first created a backup using "cp -a --reflink=always", ran initdb on 
the new directory, and then upgraded using "pg_upgrade -b ... -B ... -d ... -D 
-k".

Alex



Re: Corruption with duplicate primary key

2019-12-11 Thread Alex Adriaanse
On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
> At first I thought maybe this might be due to collations
> changing and breaking the index silently. What collation are you using?

We're using en_US.utf8. We did not make any collation changes to my knowledge.

> 1) When you do the queries, do they use index scan or sequential scan?
> Perhaps it does sequential scan, and if you force index scan (e.g. by
> rewriting the query) it'll only find one of those rows.

By default it used an index scan. When I re-ran the query today (and confirmed 
that the query used an index only scan) I did not see any duplicates. If I 
force a sequential scan using "SET enable_index[only]scan = false" the 
duplicates reappear.

However, using a backup from a week ago I see duplicates in both the query that 
uses an index only scan as well as the query that uses the sequential scan. So 
somehow over the past week the index got changed to eliminate duplicates.

> 2) Can you check in backups if this data corruption was present in the
> PG10 cluster, before running pg_upgrade? 

Sure. I just checked and did not see any corruption in the PG10 pre-upgrade 
backup. I also re-upgraded that PG10 backup to PG12, and right after the 
upgrade I did not see any corruption either. I checked using both index scans 
and sequential scans.

Alex



Re: Corruption with duplicate primary key

2019-12-11 Thread Alex Adriaanse
On Mon, December 9, 2019 at 11:05 AM Finnerty, Jim wrote:
> If you have BEFORE triggers, and a BEFORE trigger signaled failure with 
> RETURN NULL, then this is one known (and documented) issue that I think could 
> cause the behavior you're reporting:
>     
> https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none
> 
> It's hard to say if this is the cause or not, but if you have any BEFORE 
> triggers that RETURN NULL, you might want to review the documentation very 
> carefully.

We do have a BEFORE INSERT trigger, but it should never return NULL. This 
trigger INSERTs into a different table using an ON CONFLICT DO NOTHING clause 
and then does a RETURN NEW.

Alex



Re: Corruption with duplicate primary key

2020-01-15 Thread Alex Adriaanse
On Thu., December 12, 2019 at 5:25 PM, Tomas Vondra wrote:
>On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote:
>>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote:
>>> At first I thought maybe this might be due to collations changing and
>>> breaking the index silently. What collation are you using?
>>
>>We're using en_US.utf8. We did not make any collation changes to my
>>knowledge.
>
>Well, the idea was more that glibc got updated and the collations
>changed because of that (without PostgreSQL having a chance to even
>notice that).

Closing the loop on this, I've investigated this some more and it turns out 
this is exactly what happened. As you suspected, the issue had nothing to do 
with pg_upgrade or PG12, but rather the glibc upgrade that was seen in Debian 
Buster. The postgres:10 and postgres:11 images are based on Debian Stretch, 
whereas postgres:12 is based on Buster.

When I kept the database on an older version of Postgres (10 or 11) but 
switched from the older Docker image to the postgres:12 or debian:buster(-slim) 
image, manually installing older Postgres packages inside those images, I saw 
index corruption there too.

Thanks for the input!

Alex