I ran into a rather unusual problem today where  Postgres brought down a 
database to avoid transaction wraparound in a situation where it doesn't appear 
that it should have.

The error in the log is explicit enough...

Nov 16 04:00:03 SRP1 postgres[58101]: [1-1] FATAL:  database is not accepting 
commands to avoid wraparound data loss in database "stat"
Nov 16 04:00:03 SRP1 postgres[58101]: [1-2] HINT:  Stop the postmaster and use 
a standalone backend to vacuum database "stat".

Yet, going back several days in the logs, there were none of the usual WARNING 
messages in the log about this situation occurring in xxx transactions.

When I query datfrozenxid value in pg_database it certainly showed a problem.

    
backend> SELECT datname,datfrozenxid, age(datfrozenxid) FROM pg_database;
         1: datname     (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid        (typeid = 28, len = 4, typmod = -1, byval = t)
         3: age (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "postgres"        (typeid = 19, len = 64, typmod = -1, 
byval = f)
         2: datfrozenxid = "2699851604" (typeid = 28, len = 4, typmod = -1, 
byval = t)
         3: age = "1269165380"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "stat" (typeid = 19, len = 64, typmod = -1, byval = f)
         2: datfrozenxid = "1822525199" (typeid = 28, len = 4, typmod = -1, 
byval = t)
***         3: age = "2146491785"  (typeid = 23, len = 4, typmod = -1, byval = 
t)
        ----
         1: datname = "config"       (typeid = 19, len = 64, typmod = -1, byval 
= f)
         2: datfrozenxid = "3869013990" (typeid = 28, len = 4, typmod = -1, 
byval = t)
         3: age = "100002994"   (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "template1"       (typeid = 19, len = 64, typmod = -1, 
byval = f)
         2: datfrozenxid = "2000352260" (typeid = 28, len = 4, typmod = -1, 
byval = t)
         3: age = "1968664724"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----
         1: datname = "template0"       (typeid = 19, len = 64, typmod = -1, 
byval = f)
         2: datfrozenxid = "2000357564" (typeid = 28, len = 4, typmod = -1, 
byval = t)
         3: age = "1968659420"  (typeid = 23, len = 4, typmod = -1, byval = t)
        ----


I then ran a query to check the age of all the relfrozenxid on each of the 
tables in the DB and it indicated that they were all OK.


backend> select sum(case when age(relfrozenxid) > 2000000000 then 1 else 0 end) 
as gt_2billion, sum(case when age(relfrozenxid) between 1500000000 and 
2000000000 then 1 else 0 end) as gt_1_5billion, sum(case when age(relfrozenxid) 
between 1000000000 and 1500000000 then 1 else 0 end) as gt_1billion, sum(case 
when age(relfrozenxid) between 500000000 and 1000000000 then 1 else 0 end) as 
gt_500million, sum(case when age(relfrozenxid) between 100000000 and 500000000 
then 1 else 0 end) as gt_100million, sum(case when age(relfrozenxid) < 
100000000 then 1 else 0 end) as lt_100million from pg_class where relkind in 
('r','t');
         1: gt_2billion (typeid = 20, len = 8, typmod = -1, byval = f)
         2: gt_1_5billion       (typeid = 20, len = 8, typmod = -1, byval = f)
         3: gt_1billion (typeid = 20, len = 8, typmod = -1, byval = f)
         4: gt_500million       (typeid = 20, len = 8, typmod = -1, byval = f)
         5: gt_100million       (typeid = 20, len = 8, typmod = -1, byval = f)
         6: lt_100million       (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
         1: gt_2billion = "0"   (typeid = 20, len = 8, typmod = -1, byval = f)
         2: gt_1_5billion = "0" (typeid = 20, len = 8, typmod = -1, byval = f)
         3: gt_1billion = "0"   (typeid = 20, len = 8, typmod = -1, byval = f)
         4: gt_500million = "628"       (typeid = 20, len = 8, typmod = -1, 
byval = f)
         5: gt_100million = "8928"      (typeid = 20, len = 8, typmod = -1, 
byval = f)
         6: lt_100million = "0" (typeid = 20, len = 8, typmod = -1, byval = f)
        ----


I confirmed this with...

 backend>  select relname,relfrozenxid,age(relfrozenxid) from pg_class where 
relkind in('r','t') order by 3 desc limit 1;
          1: relname     (typeid = 19, len = 64, typmod = -1, byval = f)
          2: relfrozenxid        (typeid = 28, len = 4, typmod = -1, byval = t)
          3: age (typeid = 23, len = 4, typmod = -1, byval = t)
         ----
          1: relname = "qoe_flowbwidth_dist_dig1_014"    (typeid = 19, len = 
64, typmod = -1, byval = f)
          2: relfrozenxid = "2970264132" (typeid = 28, len = 4, typmod = -1, 
byval = t)
          3: age = "998752902"   (typeid = 23, len = 4, typmod = -1, byval = t)
         ----

My understanding has always been that the datfrozenxid should match this 
relfrozenxid, which it evidently doesn't. 

My environment:

FreeBSD 6
PG 8.2.4(Yes, I intend to upgrade, which would be helped if someone can say 
that this problem is fixed in some future patch/release release)

approx. 300GB database, with table partitioning, and lots of data inserts, very 
few updates. It has been running for a couple of years with relatively few 
problems. Yeh Postgres! 

autovacuum enabled

MY QUESTION: Can anyone offer up ideas on what could cause this situation in a 
seemingly stable system? I'm just trying to provide a "Root Cause Analysis" to 
management and I don't know quite how to explain the root cause :-)


Additional information...

I have warm standby set up on this database and failing over to the standby 
resulted in exactly the same problem(No real surprise).

I inadvertently, "fixed" the problem by running a vacuum command against a 
single existing table while in single-user mode. Voila, the datfrozenxid 
matched the oldest relfrozenxid and I could restart the DB. Since I had a 
standby database in the same state, I tried running a vacuum there against a 
sequence instead of a table. That also fixed the problem somehow even though it 
wasn't really a legitimate command which makes me think it may be some 
difference in autovacuum versus vacuum processing, but I am no C-programmer to 
debug that.

 autovacuum_freeze_max_age 
---------------------------
 1000000000
(1 row)


 vacuum_freeze_min_age 
-----------------------
 100000000
(1 row)


Thx...Mark





                                                                                
  

Reply via email to