ERROR: cannot freeze committed xmax

2021-07-14 Thread Sasha Aliashkevich


Hi,

Few weeks ago at one of the databases we started to observe the following error 
in Postgresql logs:

ERROR: cannot freeze committed xmax 572

For the note it's Postgresql 10.14 running on RHEL:

 version
  

--
 PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)


Apparently it happens during the autovacuum and leads to the failing autovacuum 
freeze progress.
We have detected two dictionary tables where manual VACUUM FREEZE fails: 
pg_proc and pg_depend.
Below you can see the page_inspect results for the appropriate blocks as well 
as the status of xid.
Am I understand correctly that for some reason the transaction is marked as 
commited while on the page level it's rolled back so the inconsistency is 
detected?
Could you advise how to fix it and make autovacuum work again?


SELECT txid_status(572);

 txid_status 
-
 committed
(1 row)


SELECT ctid, xmin, xmax FROM pg_proc WHERE xmax=572;

  ctid   | xmin | xmax 
-+--+--
 (75,19) |  571 |  572
(1 row)

SELECT ctid, xmin, xmax FROM pg_depend WHERE xmax=572;

  ctid   | xmin | xmax 
-+--+--
 (55,76) |  571 |  572
 (55,77) |  571 |  572
 (55,78) |  571 |  572
 (55,79) |  571 |  572
 (55,80) |  571 |  572
(5 rows)


SELECT lp, 
   t_ctid AS ctid,
   t_xmin AS xmin,
   t_xmax AS xmax,
   (t_infomask & 128)::boolean AS xmax_is_lock,
   (t_infomask & 1024)::boolean AS xmax_committed,
   (t_infomask & 2048)::boolean AS xmax_rolled_back,
   (t_infomask & 4096)::boolean AS xmax_multixact
FROM heap_page_item_attrs(
get_raw_page('pg_proc', 75), 
'pg_proc'
 );

 lp |  ctid   |   xmin| xmax | xmax_is_lock | xmax_committed | 
xmax_rolled_back | xmax_multixact 
+-+---+--+--++--+
  1 | (75,1)  |   564 |0 | f| f  | t
| f  
  2 | (75,2)  |   564 |0 | f| f  | t
| f  
  3 | (75,3)  |   564 |0 | f| f  | t
| f  
  4 | (75,4)  |   564 |0 | f| f  | t
| f  
  5 | (75,5)  |   564 |0 | f| f  | t
| f  
  6 | (75,6)  |   564 |0 | f| f  | t
| f  
  7 | (75,7)  |   564 |0 | f| f  | t
| f  
  8 | (75,8)  |   564 |0 | f| f  | t
| f  
  9 | (75,9)  |   564 |0 | f| f  | t
| f  
 10 | (75,10) |   566 |0 | f| f  | t
| f  
 11 | (75,11) |   566 |0 | f| f  | t
| f  
 12 | |   |  |  ||  
|
 13 | |   |  |  ||  
|
 14 | (75,14) | 212156121 |0 | f| f  | t
| f  
 15 | (75,15) | 212156121 |0 | f| f  | t
| f  
 16 | (75,16) | 212156121 |0 | f| f  | t
| f  
 17 | (75,17) | 212156121 |0 | f| f  | t
| f  
 18 | (75,18) | 212156121 |0 | f| f  | t
| f  
 19 | (75,21) |   571 |  572 | f| f  | t
| f  
 20 | (75,20) | 212156121 |0 | f| f  | t
| f  
 21 | (75,21) |   572 |0 | f| f  | t
| f  
 22 | (75,22) |  97113057 |0 | f| f  | t
| f  
 23 | (75,23) | 212156121 |0 | f| f  | t
| f  
 24 | (75,24) | 212156121 |0 | f| f  | t
| f  
 25 | (75,25) | 212156121 |0 | f| f  | t
| f  
 26 | (75,26) | 212156121 |0 | f| f  | t
| f  
 27 | (75,27) | 212156121 |0 | f| f  | t
| f  
(27 rows)


SELECT lp, 
   t_ctid AS ctid,
   t_xmin AS xmin,
   t_xmax AS xmax,
   (t_i

Re: ERROR: cannot freeze committed xmax

2021-07-16 Thread Sasha Aliashkevich
Hi Alvaro, glad to hear from you!

This database is relatevely young, it was initdb'ed about a year ago or so and 
it was initially at 10.x. I don't know the exact minor version but the major 
version was 10 for sure. 

The problematic row is actually visible:

SELECT COUNT(*) FROM pg_proc WHERE ctid = '(75,19)'

 count 
---
 1
(1 row) 

And it refers to a known custom stored function.

Also I'm not sure if I can really skip freezing at this point because this 
table is already over autovacuum_freeze_max_age (correct me if I'm wrong):

SELECT age(c.relfrozenxid), s.setting autovacuum_freeze_max_age 
FROM pg_class c, pg_settings s 
WHERE c.relname = 'pg_proc' AND s.name = 'autovacuum_freeze_max_age';

age| autovacuum_freeze_max_age 
---+---
 213791108 | 2
(1 row)

This is a production database which I can not restart at any point 
unfortunately, so I can't change the global settings atm.

We can sacrifice that function and recreate it, but unfortunately when I 
execute DROP it fails with the error:

ERROR:  could not find tuple for rule 16396

I guess it's because of the broken pages at pg_depend.
Do you think it is safe to execute DELETE on the corresponding records in 
pr_proc and pro_depend? Would it update the broken xmax?

I feel like I'm dealing hear with some sort of data curruption.
I recall that few months ago we already had issues with the same function and 
view based on it. At that point for some reason it just stopped working. So we 
had to DROP and RECREATE it.
It fixed the issue, but apparently the data inconsistency stayed on disk and 
now we've got to the situation where AUTOVACUUM stopped working and it makes me 
really nervous.



> On 15 Jul 2021, at 15:41, Alvaro Herrera  wrote:
> 
> One thing I forgot is that these XIDs are fairly old, perhaps dating
> back to when this database was freshly initdb'd if there has been no XID
> wraparound.  In that case you were probably running a version much older
> than 10.14 when they were written.  Do you happen to know when did you
> initdb this, with what version, when did you upgrade this to 10.14?
> That may help search the commit log for bugfixes that might explain the
> bug.  I just remembered this one as my favorite candidate:
> 
> Author: Alvaro Herrera 
> Branch: master Release: REL_11_BR [d2599ecfc] 2018-05-04 18:24:45 -0300
> Branch: REL_10_STABLE Release: REL_10_4 [e1d634758] 2018-05-04 18:23:58 -0300
> Branch: REL9_6_STABLE Release: REL9_6_9 [3a11485a5] 2018-05-04 18:23:30 -0300
> 
>Don't mark pages all-visible spuriously
> 
>Dan Wood diagnosed a long-standing problem that pages containing tuples
>that are locked by multixacts containing live lockers may spuriously end
>up as candidates for getting their all-visible flag set.  This has the
>long-term effect that multixacts remain unfrozen; this may previously
>pass undetected, but since commit XYZ it would be reported as
>  "ERROR: found multixact 134100944 from before relminmxid 192042633"
>because when a later vacuum tries to freeze the page it detects that a
>multixact that should have gotten frozen, wasn't.
> 
>Dan proposed a (correct) patch that simply sets a variable to its
>correct value, after a bogus initialization.  But, per discussion, it
>seems better coding to avoid the bogus initializations altogether, since
>they could give rise to more bugs later.  Therefore this fix rewrites
>the logic a little bit to avoid depending on the bogus initializations.
> 
>This bug was part of a family introduced in 9.6 by commit a892234f830e;
>later, commit 38e9f90a227d fixed most of them, but this one was
>unnoticed.
> 
>Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera
>Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera
>Discussion: 
> https://postgr.es/m/84ebac55-f06d-4fbe-a3f3-8bda093ce...@amazon.com
> 
> 
> -- 
> Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
> "El número de instalaciones de UNIX se ha elevado a 10,
> y se espera que este número aumente" (UPM, 1972)