pá 17. 5. 2024 v 22:05 odesílatel Pavel Stehule <pavel.steh...@gmail.com> napsal:
> > > pá 17. 5. 2024 v 21:50 odesílatel Andres Freund <and...@anarazel.de> > napsal: > >> Hi, >> >> On 2024-05-17 15:12:31 +0200, Pavel Stehule wrote: >> > after migration on PostgreSQL 16 I seen 3x times (about every week) >> broken >> > tables on replica nodes. The query fails with error >> >> Migrating from what version? >> > > I think 14, but it should be verified tomorrow > upgrade was from 15.2 > >> >> You're saying that the data is correctly accessible on primaries, but >> broken >> on standbys? Is there any difference in how the page looks like on the >> primary >> vs standby? >> > > I saved one page from master and standby. Can I send it privately? There > are some private data (although not too sensitive) > > >> >> >> > ERROR: could not access status of transaction 1442871302 >> > DETAIL: Could not open file "pg_xact/0560": No such file or directory >> > >> > verify_heapam reports >> > >> > ^[[Aprd=# select * from verify_heapam('account_login_history') where >> blkno >> > = 179036; >> > blkno | offnum | attnum | msg >> > >> > >> --------+--------+--------+------------------------------------------------------------------- >> > 179036 | 30 | | xmin 1393743382 precedes oldest valid >> > transaction ID 3:1687012112 >> >> So that's not just a narrow race... >> >> >> > master >> > >> > (2024-05-17 14:36:57) prd=# SELECT * FROM >> > page_header(get_raw_page('account_login_history', 179036)); >> > lsn │ checksum │ flags │ lower │ upper │ special │ pagesize │ >> > version │ prune_xid >> > >> ───────────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼─────────── >> > A576/810F4CE0 │ 0 │ 4 │ 296 │ 296 │ 8192 │ 8192 │ >> > 4 │ 0 >> > (1 row) >> > >> > >> > replica >> > prd_aukro=# SELECT * FROM >> page_header(get_raw_page('account_login_history', >> > 179036)); >> > lsn | checksum | flags | lower | upper | special | pagesize | >> > version | prune_xid >> > >> ---------------+----------+-------+-------+-------+---------+----------+---------+----------- >> > A56C/63979DA0 | 0 | 0 | 296 | 296 | 8192 | 8192 | >> > 4 | 0 >> > (1 row) >> >> Is the replica behind the primary? Or did we somehow end up with diverging >> data? The page LSNs differ by about 40GB... >> >> Is there evidence of failed truncations of the relation in the log? From >> autovacuum? >> > > no I did not see these bugs, > >> >> Does the data in the readable versions of the tuples on that page actually >> look valid? Is it possibly duplicated data? >> > > looks well, I didn't see any strange content > >> >> >> I'm basically wondering whether it's possible that we errored out during >> truncation (e.g. due to a file permission issue or such). Due to some >> brokenness in RelationTruncate() that can lead to data divergence between >> primary and standby and to old tuples re-appearing on either. >> >> >> Another question: Do you use pg_repack or such? >> > > pg_repack was used 2 months before migration > > > >> >> Greetings, >> >> Andres Freund >> >