Hi hackers,

We recently ran into an issue where the visibility map of a relation was 
corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from 
this table is:

could not access status of transaction 3704450152
DETAIL:  Could not open file "pg_xact/0DCC": No such file or directory.

On the lists I could find several similar reports, but corruption like this 
could obviously have a very wide range of root causes.. see [1] [2] [3] for 
example - not all of them have their root cause known.

This particular case was similar to reported cases above, but also has some 
differences.

The following query returns ~21.000 rows, which indicates something 
inconsistent between the visibility map and the pd_all_visible flag on the page:

select * from pg_check_frozen('tbl');

Looking at one of the affected pages with pageinspect:

=# SELECT 
lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_oid
 FROM heap_page_items(get_raw_page('tbl', 726127));
┌────┬────────┬──────────┬────────┬────────────┬────────────┬──────────┬────────────┬─────────────┬────────────┬────────┬───────┐
│ lp │ lp_off │ lp_flags │ lp_len │   t_xmin   │   t_xmax   │ t_field3 │   
t_ctid   │ t_infomask2 │ t_infomask │ t_hoff │ t_oid │
├────┼────────┼──────────┼────────┼────────────┼────────────┼──────────┼────────────┼─────────────┼────────────┼────────┼───────┤
│  1 │   6328 │        1 │   1864 │ 3704450155 │ 3704450155 │        1 │ 
(726127,1) │         249 │       8339 │     56 │     ∅ │
│  2 │   4464 │        1 │   1864 │ 3704450155 │ 3704450155 │        1 │ 
(726127,2) │         249 │       8339 │     56 │     ∅ │
│  3 │   2600 │        1 │   1864 │ 3704450155 │ 3704450155 │        1 │ 
(726127,3) │         249 │       8339 │     56 │     ∅ │
│  4 │    680 │        1 │   1920 │ 3704450155 │ 3704450155 │        1 │ 
(726127,4) │         249 │       8339 │     56 │     ∅ │
└────┴────────┴──────────┴────────┴────────────┴────────────┴──────────┴────────────┴─────────────┴────────────┴────────┴───────┘

t_infomask shows that HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID bits are both 
unset.
This pg_visibility() call shows the inconsistency between VM and page, with 
PD_ALL_VISIBLE=false

=# select * from pg_visibility('tbl', 726127);
┌─────────────┬────────────┬────────────────┐
│ all_visible │ all_frozen │ pd_all_visible │
├─────────────┼────────────┼────────────────┤
│ t           │ t          │ f              │
└─────────────┴────────────┴────────────────┘

Looking at other pages show the same information.
What's interesting is that out of the affected tuples returned by 
pg_check_frozen, over 99% belong to 1 transaction (3704450155 as seen above) 
and the remaining few are from one other transaction that occurred at roughly 
the same time.
I find it hard to believe that this is due to some random bit flipping, because 
many pages are affected, but the "incorrect" ones are in total only from two 
specific transactions which occurred at roughly the same time. There were also 
no server crashes or other known failures around the time of this transaction. 
I'm not ruling out any other kind of failure still, but I also cannot really 
explain how this could have happened.

The server has PG12.4 with full_page_writes=on, data_checksums=off. It's a 
large analytics database. The VM inconsistencies also occur on the streaming 
replicas.

I realize these cases are pretty rare and hard to debug, but I wanted to share 
the information I found so far here for reference. Maybe someone has an idea 
what occurred, or maybe someone in the future finds it useful when he finds 
something similar.

I have no idea how the inconsistency between VM and PD_ALL_VISIBLE started - 
from looking through the code I can't really find any way how this could occur. 
However, for it to lead to the problem described here, I believe there should 
be *no* SELECT that touches that particular page after the insert/update 
transaction and before the transaction log gets truncated. If the page is read 
before the transaction log gets truncated, then the hint bit 
HEAP_XMIN_COMMITTED will get set and future reads will succeed regardless of tx 
log truncation. One of the replica's had this happen to it: the affected pages 
are identical to the primary except that the HEAP_XMIN_COMMITTED flag is set 
(note that the VM inconsistency is still there on the replica though: 
PD_ALL_VISIBLE=false even though VM shows that all_frozen=all_visible=true). 
But I can query these rows on the replica without issues, because it doesn't 
check the tx log when it sees that HEAP_XMIN_COMMITTED is set.

-Floris

[1] https://postgrespro.com/list/thread-id/2422376
[2] https://postgrespro.com/list/thread-id/2501800
[3] https://postgrespro.com/list/thread-id/2321949

Reply via email to