We have primary and hot standby databases running Postgres 11.3 inside Docker, 
with their data directories bind-mounted to a reflink-enabled XFS filesystem. 
The VM is running Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance.

I've seen TOAST corruption in one of the standby databases a few months ago in 
a ~44GB table, so I wiped the database and rebuilt it using pg_basebackup, 
which eliminated the corruption. This week I've seen corruption pop up again in 
the same table in one of the standby databases. The other standby database 
experienced no corruption.

The corrupted table has four columns of types integer, text, text, and jsonb. 
The corruption happens inside the jsonb column.

The corruption manifests itself as follows in the standby database:

SELECT length(json::text) FROM <table> WHERE identity = '...';
ERROR:  missing chunk number 0 for toast value 64265646 in pg_toast_16103925

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 ctid | chunk_id | chunk_seq | md5
------+----------+-----------+-----
(0 rows)

SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 count
-------
     2
(1 row)

Looking up the TOAST block that is supposed to contain this value you can see 
that the TOAST tuples are missing:

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', 
'(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)');
    ctid     | chunk_id | chunk_seq |               md5
-------------+----------+-----------+----------------------------------
 (1793121,3) | 41259162 |         0 | 1bff36f306bac135cce9da44dd6d6bbb
 (1793121,4) | 41259162 |         1 | b754d688c5c847c7bc519e65741ffef1
 (1793121,5) | 41259163 |         0 | 10dfa4f5b3e32188f0b4b28c9be76abe
 (1793121,6) | 41259163 |         1 | 7dceb98b2c2f4ac3c72245c58c85323f
(4 rows)

For comparison here are the same queries against the primary database:

SELECT length(json::text) FROM <table> WHERE identity = '...';
 length
--------
   7817
(1 row)

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
    ctid     | chunk_id | chunk_seq |               md5
-------------+----------+-----------+----------------------------------
 (1793121,1) | 64265646 |         0 | a9a2642e8408fc178fb809b86c430997
 (1793121,2) | 64265646 |         1 | 371bc2628ac5bfc8b37d32f93d08fefe
(2 rows)

SELECT count(1) FROM pg_toast.pg_toast_16103925 WHERE chunk_id = 64265646;
 count
-------
     2
(1 row)

SELECT ctid, chunk_id, chunk_seq, md5(chunk_data) FROM 
pg_toast.pg_toast_16103925 WHERE ctid IN ('(1793121,1)', '(1793121,2)', 
'(1793121,3)', '(1793121,4)', '(1793121,5)', '(1793121,6)', '(1793121,7)');
    ctid     | chunk_id | chunk_seq |               md5
-------------+----------+-----------+----------------------------------
 (1793121,1) | 64265646 |         0 | a9a2642e8408fc178fb809b86c430997
 (1793121,2) | 64265646 |         1 | 371bc2628ac5bfc8b37d32f93d08fefe
 (1793121,3) | 41259162 |         0 | 1bff36f306bac135cce9da44dd6d6bbb
 (1793121,4) | 41259162 |         1 | b754d688c5c847c7bc519e65741ffef1
 (1793121,5) | 41259163 |         0 | 10dfa4f5b3e32188f0b4b28c9be76abe
 (1793121,6) | 41259163 |         1 | 7dceb98b2c2f4ac3c72245c58c85323f
(6 rows)

Looking at the data file for the TOAST relation, the header data structures in 
the relevant block seem fine to me, which makes me think this is not caused by 
filesystem corruption (unless a write silently failed). The second half of that 
block is identical between the primary and corrupted standby, but in the first 
half the corrupted standby is missing data.

Standby (corrupted):

# dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | 
hexdump -C | head -8
00000000  a3 0e 00 00 48 46 88 0e  00 00 05 00 30 00 58 0f  |....HF......0.X.|
00000010  00 20 04 20 00 00 00 00  00 00 00 00 00 00 00 00  |. . ............|
00000020  10 98 e0 0f 98 97 e8 00  a8 8f e0 0f 58 8f 96 00  |............X...|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000f50  00 00 00 00 00 00 00 00  32 b0 0a 01 00 00 00 00  |........2.......|
00000f60  00 00 00 00 1b 00 61 5c  06 00 03 00 02 09 18 00  |......a\........|
00000f70  9b 90 75 02 01 00 00 00  ac 00 00 00 83 9f 64 00  |..u...........d.|

Primary:

# dd if=data/base/18034/16103928.13 bs=8192 skip=89185 count=1 status=none | 
hexdump -C | head -8
00000000  bd 0e 00 00 08 ad 32 b7  00 00 05 00 30 00 90 04  |......2.....0...|
00000010  00 20 04 20 00 00 00 00  68 87 e0 0f 90 84 a8 05  |. . ....h.......|
00000020  10 98 e0 0f 98 97 e8 00  a8 8f e0 0f 58 8f 96 00  |............X...|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000490  a6 07 7e 02 00 00 00 00  00 00 00 00 1b 00 61 5c  |..~...........a\|
000004a0  02 00 03 00 02 09 18 00  ae 9d d4 03 01 00 00 00  |................|
000004b0  d0 0a 00 00 23 25 10 07  88 02 13 0f 2c 04 78 01  |....#%......,.x.|

Based on the above observations it seems to me that occasionally some of the 
changes aren't replicating to or persisting by the standby database. In the 
past I've seen some TCP packets get mangled or dropped between our EC2 
instances, leading to sudden disconnects. The standby connects to the primary 
using SSL (sslmode=require sslcompression=1) so I would think if there's any 
network-level corruption SSL would catch it, causing the connection to fail and 
reconnect. Outside of any SSL disconnects (which don't happen often), this 
database is stopped and restarted twice a week so we can clone it (using cp -a 
--reflink=always).

Any ideas on what might be causing this?

Thanks,

Alex

Reply via email to