ERROR: cannot freeze committed xmax
Hi, Few weeks ago at one of the databases we started to observe the following error in Postgresql logs: ERROR: cannot freeze committed xmax 572 For the note it's Postgresql 10.14 running on RHEL: version -- PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit (1 row) Apparently it happens during the autovacuum and leads to the failing autovacuum freeze progress. We have detected two dictionary tables where manual VACUUM FREEZE fails: pg_proc and pg_depend. Below you can see the page_inspect results for the appropriate blocks as well as the status of xid. Am I understand correctly that for some reason the transaction is marked as commited while on the page level it's rolled back so the inconsistency is detected? Could you advise how to fix it and make autovacuum work again? SELECT txid_status(572); txid_status - committed (1 row) SELECT ctid, xmin, xmax FROM pg_proc WHERE xmax=572; ctid | xmin | xmax -+--+-- (75,19) | 571 | 572 (1 row) SELECT ctid, xmin, xmax FROM pg_depend WHERE xmax=572; ctid | xmin | xmax -+--+-- (55,76) | 571 | 572 (55,77) | 571 | 572 (55,78) | 571 | 572 (55,79) | 571 | 572 (55,80) | 571 | 572 (5 rows) SELECT lp, t_ctid AS ctid, t_xmin AS xmin, t_xmax AS xmax, (t_infomask & 128)::boolean AS xmax_is_lock, (t_infomask & 1024)::boolean AS xmax_committed, (t_infomask & 2048)::boolean AS xmax_rolled_back, (t_infomask & 4096)::boolean AS xmax_multixact FROM heap_page_item_attrs( get_raw_page('pg_proc', 75), 'pg_proc' ); lp | ctid | xmin| xmax | xmax_is_lock | xmax_committed | xmax_rolled_back | xmax_multixact +-+---+--+--++--+ 1 | (75,1) | 564 |0 | f| f | t | f 2 | (75,2) | 564 |0 | f| f | t | f 3 | (75,3) | 564 |0 | f| f | t | f 4 | (75,4) | 564 |0 | f| f | t | f 5 | (75,5) | 564 |0 | f| f | t | f 6 | (75,6) | 564 |0 | f| f | t | f 7 | (75,7) | 564 |0 | f| f | t | f 8 | (75,8) | 564 |0 | f| f | t | f 9 | (75,9) | 564 |0 | f| f | t | f 10 | (75,10) | 566 |0 | f| f | t | f 11 | (75,11) | 566 |0 | f| f | t | f 12 | | | | || | 13 | | | | || | 14 | (75,14) | 212156121 |0 | f| f | t | f 15 | (75,15) | 212156121 |0 | f| f | t | f 16 | (75,16) | 212156121 |0 | f| f | t | f 17 | (75,17) | 212156121 |0 | f| f | t | f 18 | (75,18) | 212156121 |0 | f| f | t | f 19 | (75,21) | 571 | 572 | f| f | t | f 20 | (75,20) | 212156121 |0 | f| f | t | f 21 | (75,21) | 572 |0 | f| f | t | f 22 | (75,22) | 97113057 |0 | f| f | t | f 23 | (75,23) | 212156121 |0 | f| f | t | f 24 | (75,24) | 212156121 |0 | f| f | t | f 25 | (75,25) | 212156121 |0 | f| f | t | f 26 | (75,26) | 212156121 |0 | f| f | t | f 27 | (75,27) | 212156121 |0 | f| f | t | f (27 rows) SELECT lp, t_ctid AS ctid, t_xmin AS xmin, t_xmax AS xmax, (t_i
Re: ERROR: cannot freeze committed xmax
Hi Alvaro, glad to hear from you! This database is relatevely young, it was initdb'ed about a year ago or so and it was initially at 10.x. I don't know the exact minor version but the major version was 10 for sure. The problematic row is actually visible: SELECT COUNT(*) FROM pg_proc WHERE ctid = '(75,19)' count --- 1 (1 row) And it refers to a known custom stored function. Also I'm not sure if I can really skip freezing at this point because this table is already over autovacuum_freeze_max_age (correct me if I'm wrong): SELECT age(c.relfrozenxid), s.setting autovacuum_freeze_max_age FROM pg_class c, pg_settings s WHERE c.relname = 'pg_proc' AND s.name = 'autovacuum_freeze_max_age'; age| autovacuum_freeze_max_age ---+--- 213791108 | 2 (1 row) This is a production database which I can not restart at any point unfortunately, so I can't change the global settings atm. We can sacrifice that function and recreate it, but unfortunately when I execute DROP it fails with the error: ERROR: could not find tuple for rule 16396 I guess it's because of the broken pages at pg_depend. Do you think it is safe to execute DELETE on the corresponding records in pr_proc and pro_depend? Would it update the broken xmax? I feel like I'm dealing hear with some sort of data curruption. I recall that few months ago we already had issues with the same function and view based on it. At that point for some reason it just stopped working. So we had to DROP and RECREATE it. It fixed the issue, but apparently the data inconsistency stayed on disk and now we've got to the situation where AUTOVACUUM stopped working and it makes me really nervous. > On 15 Jul 2021, at 15:41, Alvaro Herrera wrote: > > One thing I forgot is that these XIDs are fairly old, perhaps dating > back to when this database was freshly initdb'd if there has been no XID > wraparound. In that case you were probably running a version much older > than 10.14 when they were written. Do you happen to know when did you > initdb this, with what version, when did you upgrade this to 10.14? > That may help search the commit log for bugfixes that might explain the > bug. I just remembered this one as my favorite candidate: > > Author: Alvaro Herrera > Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300 > Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300 > Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300 > >Don't mark pages all-visible spuriously > >Dan Wood diagnosed a long-standing problem that pages containing tuples >that are locked by multixacts containing live lockers may spuriously end >up as candidates for getting their all-visible flag set. This has the >long-term effect that multixacts remain unfrozen; this may previously >pass undetected, but since commit XYZ it would be reported as > "ERROR: found multixact 134100944 from before relminmxid 192042633" >because when a later vacuum tries to freeze the page it detects that a >multixact that should have gotten frozen, wasn't. > >Dan proposed a (correct) patch that simply sets a variable to its >correct value, after a bogus initialization. But, per discussion, it >seems better coding to avoid the bogus initializations altogether, since >they could give rise to more bugs later. Therefore this fix rewrites >the logic a little bit to avoid depending on the bogus initializations. > >This bug was part of a family introduced in 9.6 by commit a892234f830e; >later, commit 38e9f90a227d fixed most of them, but this one was >unnoticed. > >Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera >Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera >Discussion: > https://postgr.es/m/84ebac55-f06d-4fbe-a3f3-8bda093ce...@amazon.com > > > -- > Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ > "El número de instalaciones de UNIX se ha elevado a 10, > y se espera que este número aumente" (UPM, 1972)