TOAST corruption in standby database
We have primary and hot standby databases running Postgres 11.3 inside Docker, with their data directories bind-mounted to a reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. I've seen TOAST corruption in one of the standby databases a few months ago in a ~44GB table, so I wiped the database and rebuilt it using pg_basebackup, which eliminated the corruption. This week I've seen corruption pop up again in the same table in one of the standby databases. The other standby database experienced no corruption. The corrupted table has four columns of types integer, text, text, and jsonb. The corruption happens inside the jsonb column. The corruption manifests itself as follows in the standby database: SELECT length(json::text) FROM WHERE identity = '...'; ERROR: missing chunk number 0 for toast value 64265646 in pg_toast_16103925 SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646; ctid | chunk_id | chunk_seq | md5 --+--+---+- (0 rows) SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646; count --- 2 (1 row) Looking up the TOAST block that is supposed to contain this value you can see that the TOAST tuples are missing: SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', '(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)'); ctid | chunk_id | chunk_seq | md5 -+--+---+-- (1793121,3) | 41259162 | 0 | 1bff36f306bac135cce9da44dd6d6bbb (1793121,4) | 41259162 | 1 | b754d688c5c847c7bc519e65741ffef1 (1793121,5) | 41259163 | 0 | 10dfa4f5b3e32188f0b4b28c9be76abe (1793121,6) | 41259163 | 1 | 7dceb98b2c2f4ac3c72245c58c85323f (4 rows) For comparison here are the same queries against the primary database: SELECT length(json::text) FROM WHERE identity = '...'; length 7817 (1 row) SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646; ctid | chunk_id | chunk_seq | md5 -+--+---+-- (1793121,1) | 64265646 | 0 | a9a2642e8408fc178fb809b86c430997 (1793121,2) | 64265646 | 1 | 371bc2628ac5bfc8b37d32f93d08fefe (2 rows) SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646; count --- 2 (1 row) SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', '(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)'); ctid | chunk_id | chunk_seq | md5 -+--+---+-- (1793121,1) | 64265646 | 0 | a9a2642e8408fc178fb809b86c430997 (1793121,2) | 64265646 | 1 | 371bc2628ac5bfc8b37d32f93d08fefe (1793121,3) | 41259162 | 0 | 1bff36f306bac135cce9da44dd6d6bbb (1793121,4) | 41259162 | 1 | b754d688c5c847c7bc519e65741ffef1 (1793121,5) | 41259163 | 0 | 10dfa4f5b3e32188f0b4b28c9be76abe (1793121,6) | 41259163 | 1 | 7dceb98b2c2f4ac3c72245c58c85323f (6 rows) Looking at the data file for the TOAST relation, the header data structures in the relevant block seem fine to me, which makes me think this is not caused by filesystem corruption (unless a write silently failed). The second half of that block is identical between the primary and corrupted standby, but in the first half the corrupted standby is missing data. Standby (corrupted): # dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | hexdump -C | head -8 a3 0e 00 00 48 46 88 0e 00 00 05 00 30 00 58 0f |HF..0.X.| 0010 00 20 04 20 00 00 00 00 00 00 00 00 00 00 00 00 |. . | 0020 10 98 e0 0f 98 97 e8 00 a8 8f e0 0f 58 8f 96 00 |X...| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 0f50 00 00 00 00 00 00 00 00 32 b0 0a 01 00 00 00 00 |2...| 0f60 00 00 00 00 1b 00 61 5c 06 00 03 00 02 09 18 00 |..a\| 0f70 9b 90 75 02 01 00 00 00 ac 00 00 00 83 9f 64 00 |..u...d.| Primary: # dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | hexdump -C | head -8 bd 0e 00 00 08 ad 32 b7 00 00 05 00 30 00 90 04 |..2.0...| 0010 00 20 04 20 00 00 00 00 68 87 e0 0f 90 84 a8 05 |. . h...| 0020 10 98 e0 0f 98 97 e8 00 a8 8f e0 0f 58 8f 96 00 |X...| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 0490 a6 07 7e 02 00 00 00 00 00 00 00 00 1b 00 61 5c |..~...a\| 04a0 02 00 03 00 02 09 18 00 ae 9d d4 03 01 00 00 00 |
Corruption with duplicate primary key
We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that there are rows in one of the tables that have duplicate primary keys: record_loader=# \d loader.sync Table "loader.sync" Column | Type | Collation | Nullable | Default ---+--+---+--+- source| text | | not null | natural_key | text | | not null | payload | jsonb| | | dispatched| timestamp with time zone | | not null | now() initial_load_id | text | | | deleted_load_id | text | | | created_timestamp | timestamp with time zone | | | now() updated_timestamp | timestamp with time zone | | | now() deleted_timestamp | timestamp with time zone | | | Indexes: "sync_pkey" PRIMARY KEY, btree (source, natural_key) Publications: "debezium" This table is modified via triggers that fire off when a COPY command inserts many rows into another table. Here are two example duplicate rows: # SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp, updated_timestamp, deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY xmin::text::int, cmin::text::int; -[ RECORD 1 ]-+- xmin | 116649 xmax | 0 cmin | 5304404 cmax | 5304404 source| ok_lease natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 dispatched| 2019-11-24 05:09:36.099686+00 created_timestamp | 2019-11-24 05:09:36.099686+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | -[ RECORD 2 ]-+- xmin | 116649 xmax | 118583 cmin | 5312208 cmax | 5312208 source| ok_lease natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 dispatched| 2019-11-10 05:09:24.214964+00 created_timestamp | 2019-05-17 21:24:19.558219+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | 2019-11-24 05:09:36.099686+00 It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on, although it doesn't appear that the delete transaction got committed), and then the first row was inserted within the same transaction that updated the second row. Another example: -[ RECORD 1 ]-+- xmin | 116649 xmax | 0 cmin | 5304403 cmax | 5304403 source| ok_lease natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 dispatched| 2019-11-10 05:09:24.214964+00 created_timestamp | 2019-05-31 06:00:33.765547+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | 2019-11-24 05:09:36.099686+00 -[ RECORD 2 ]-+- xmin | 116649 xmax | 0 cmin | 5304404 cmax | 5304404 source| ok_lease natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 dispatched| 2019-11-24 05:09:36.099686+00 created_timestamp | 2019-11-24 05:09:36.099686+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | Both examples have in common that the two duplicate rows were touched within the same transaction. This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via pgoutput. Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows: record_loader=# BEGIN; BEGIN record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey; ALTER TABLE record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key); ERROR: could not create unique index "sync_pkey" DETAIL: Key (source, natural_key)=(ok_lease, ...) is duplicated. CONTEXT: parallel worker Any ideas on what might cause this behavior? Thanks, Alex
Re: Corruption with duplicate primary key
On Thu, December 5, 2019 at 5:34 PM Peter Geoghegan wrote: > > We have a Postgres 10 database that we recently upgraded to Postgres 12 > > using pg_upgrade. We recently discovered that there are rows in one of the > > tables that have duplicate primary keys: > > What's the timeline here? In other words, does it look like these rows > were updated and/or deleted before, around the same time as, or after > the upgrade? The Postgres 12 upgrade was performed on 2019-11-22, so the affected rows were modified after this upgrade (although some of the rows were originally inserted before then, before they were modified/duplicated). > > This database runs inside Docker, with the data directory bind-mounted to a > > reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 > > kernel inside an AWS EC2 instance. We have Debezium stream data from this > > database via pgoutput. > > That seems suspicious, since reflink support for XFS is rather immature. Good point. Looking at kernel commits since 4.19.16 it appears that there have been a few bug fixes in later kernel versions that address a few XFS corruption issues. Regardless of whether FS bugs are responsible of this corruption I'll plan on upgrading to a newer kernel. > How did you invoke pg_upgrade? Did you use the --link (hard link) option? Yes, we first created a backup using "cp -a --reflink=always", ran initdb on the new directory, and then upgraded using "pg_upgrade -b ... -B ... -d ... -D -k". Alex
Re: Corruption with duplicate primary key
On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: > At first I thought maybe this might be due to collations > changing and breaking the index silently. What collation are you using? We're using en_US.utf8. We did not make any collation changes to my knowledge. > 1) When you do the queries, do they use index scan or sequential scan? > Perhaps it does sequential scan, and if you force index scan (e.g. by > rewriting the query) it'll only find one of those rows. By default it used an index scan. When I re-ran the query today (and confirmed that the query used an index only scan) I did not see any duplicates. If I force a sequential scan using "SET enable_index[only]scan = false" the duplicates reappear. However, using a backup from a week ago I see duplicates in both the query that uses an index only scan as well as the query that uses the sequential scan. So somehow over the past week the index got changed to eliminate duplicates. > 2) Can you check in backups if this data corruption was present in the > PG10 cluster, before running pg_upgrade? Sure. I just checked and did not see any corruption in the PG10 pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and right after the upgrade I did not see any corruption either. I checked using both index scans and sequential scans. Alex
Re: Corruption with duplicate primary key
On Mon, December 9, 2019 at 11:05 AM Finnerty, Jim wrote: > If you have BEFORE triggers, and a BEFORE trigger signaled failure with > RETURN NULL, then this is one known (and documented) issue that I think could > cause the behavior you're reporting: > > https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none > > It's hard to say if this is the cause or not, but if you have any BEFORE > triggers that RETURN NULL, you might want to review the documentation very > carefully. We do have a BEFORE INSERT trigger, but it should never return NULL. This trigger INSERTs into a different table using an ON CONFLICT DO NOTHING clause and then does a RETURN NEW. Alex
Re: Corruption with duplicate primary key
On Thu., December 12, 2019 at 5:25 PM, Tomas Vondra wrote: >On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote: >>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: >>> At first I thought maybe this might be due to collations changing and >>> breaking the index silently. What collation are you using? >> >>We're using en_US.utf8. We did not make any collation changes to my >>knowledge. > >Well, the idea was more that glibc got updated and the collations >changed because of that (without PostgreSQL having a chance to even >notice that). Closing the loop on this, I've investigated this some more and it turns out this is exactly what happened. As you suspected, the issue had nothing to do with pg_upgrade or PG12, but rather the glibc upgrade that was seen in Debian Buster. The postgres:10 and postgres:11 images are based on Debian Stretch, whereas postgres:12 is based on Buster. When I kept the database on an older version of Postgres (10 or 11) but switched from the older Docker image to the postgres:12 or debian:buster(-slim) image, manually installing older Postgres packages inside those images, I saw index corruption there too. Thanks for the input! Alex