Hi,

On Thursday, June 07, 2012 12:44:08 PM Valentine Gogichashvili wrote:
> I have the situation again, one of 3 slaves was slow to play all the WAL
> files and being about 10GB late it crashed with the same error again.
> 
> I collected DEBUG4 output in this time:
> https://docs.google.com/open?id=0B2NMMrfiBQcLZjNDbU0xQ3lvWms
Ok, I stared at this some time and I think I see what the problem is. Some log 
excerpts that lead my reasoning:

2012-06-06 15:35:51.954 "recovery snapshot waiting for non-overflowed snapshot 
or until oldest active xid on standby is at least 3730302193 (now 
3730301766)",,,,,"xlog redo  running xacts: nextXid 3730302194 
latestCompletedXid 3730302179 oldestRunningXid 3730301766; 68 xacts: ...  
subxid ovf",,,,""
So we found a overflowed snapshot after we already are in 
STANDBY_SNAPSHOT_PENDING. This implies we have seen an overflowed snapshot 
before.
In STANDBY_SNAPSHOT_PENDING we start recording known assigned xids to build 
our snapshot incrementally.

2012-06-06 15:35:51.954 record known xact 3730301766 latestObservedXid 
3730302197
2012-06-06 15:37:49.580 record known xact 3730316333 latestObservedXid 
3730316332 3730316517
Ok, we started filling the KnownAssignedXid machinery. That means procArray-
>numKnownAssignedXids != 0.


2012-06-06 15:37:54.401 FATAL,XX000,"out-of-order XID insertion in 
KnownAssignedXids",,,,,"xlog redo  running xacts: nextXid 3730316518 
latestCompletedXid 3730316517 oldestRunningXid 3730316333; 4 xacts: 3730316429 
3730316508 3730316333 3730316430",,,,""
the "xlog redo running xacts" bits tells us we have got a snapshot that is 
*not* overflowed.
That means in ProcArrayAppylRecoveryInfo in the following part:
        if (standbyState == STANDBY_SNAPSHOT_PENDING)
        {
                /*
                 * If the snapshot isn't overflowed or if its empty we can
                 * reset our pending state and use this snapshot instead.
                 */
                if (!running->subxid_overflow || running->xcnt == 0)
                {
                        standbyState = STANDBY_INITIALIZED;
                }
                else
                {
...
                }
        }
we reset our state back to STANDBY_INITIALIZED. And fall back to computing our 
state at once in one piece instead of incrementally as we started doing for     
SNAPSHOT_PENDING.
The problem is is know that that code assumes we run without any previous 
recorded xids. This is even formalized in an assert which we don't hit because 
were running without Assert:
...
        Assert(standbyState == STANDBY_INITIALIZED);
...
        Assert(procArray->numKnownAssignedXids == 0);
...
after that we start adding all currently running xids from the snapshot to the 
KnownAssigned machinery. They are already recorded though, so we fail in 
KnownAssignedXidsAdd with the OPs error.

The simplest fix for that seems to be to simply reset the KnownAssignedXids 
state in the above branch. Any arguments against that?

This seems to have been broken in commit 
10b7c686e52a6d1bb10194ebf9331ef06f044d46

Andres

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to