The following bug has been logged on the website: Bug reference: 6661 Logged by: Valentine Gogichashvili Email address: val...@gmail.com PostgreSQL version: 9.0.7 Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41sque Description:
Hello, today when trying to restore a replication database I got the following error: ... 2012-05-22 21:20:24.823 CEST,,,23804,,4fbbe69e.5cfc,124,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001E"" from archive",,,,,,,,,"" 2012-05-22 21:20:24.954 CEST,,,23804,,4fbbe69e.5cfc,125,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001F"" from archive",,,,,,,,,"" 2012-05-22 21:20:26.374 CEST,,,23804,,4fbbe69e.5cfc,126,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000020"" from archive",,,,,,,,,"" 2012-05-22 21:20:27.238 CEST,,,23804,,4fbbe69e.5cfc,127,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000021"" from archive",,,,,,,,,"" 2012-05-22 21:20:27.443 CEST,,,23804,,4fbbe69e.5cfc,128,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"consistent recovery state reached at 1C9C/21318C20",,,,,,,,,"" 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=916) [0]=3674726497 [1]=3674727041 [2]=3674727042 [3]=3674727128 ",,,,,"xlog redo running xacts: nextXid 3674728633 latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128 3674726497",,,,"" 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633 latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128 3674726497",,,,"" 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54 CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code 1",,,,,,,,,"" 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" This happened after replaying 124 WAL files from archive. Full log is here: https://gist.github.com/ff51ccf506f5e522e271 I tried to start the database 3 times, with the same result. Additionally I removed all existing WAL files from pg_xlog, and tried to start again, but got the same result this time as well (though probably I should have tried the other way round, as WAL files are anyway taken from the archive if there are there... anyway, just want to describe everything I did) The database on that machine is running as a read-only hot-standby machine for running long running statistical queries, that we do not want to run on the production system. It has significantly bigger shared_buffers = 48GB ( in comparison to 8GB on the master ). Sometimes there is really a lot happening on the master machine, and the replication machine does not seem to be able to catch up with the master. This usually happens, when the replay is paused because of a long running query, but after the conflicting query is gone and the replication delay is bigger then 2-3 Gigabytes, it seems not to be able catch up any longer, it looks like a snowball effect. At this time it has a 100% of one CPU on recovery process with practically no IOWait (so for me it seems like a CPU bottleneck rather then IO one, that is rather confusing). Here is the replication delay graph https://lh4.googleusercontent.com/-UBf1iBHZGlQ/T7v9L-bpLvI/AAAAAAAAMms/Ve-eAOFEXjw/s800/replication_delay.png In the past, if the replication delay was bigger then 20GB, I was stopping the server, restarting it so, that is replays the WAL files from the archive and subjectively it seemed, like the replay was faster, if shared_buffers was set to much lower value (4GB). So today, I did the same, but got an error "out-of-order XID insertion in KnownAssignedXids" and could not restore the machine. Now, while rebuilding hot-standby of that 250GB database I have some time to write this report. More information about the affected machine: Intel(R) Xeon(R) CPU E5540 @ 2.53GHz (16 Cores) cat /proc/version Linux version 2.6.32-5-amd64 (Debian 2.6.32-41squeeze2) (da...@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu Mar 22 17:26:33 UTC 2012 free -m total used free shared buffers cached Mem: 129192 128858 333 0 31 124969 -/+ buffers/cache: 3857 125334 Swap: 1952 1 1951 df -h /dev/mapper/vg00-data 54G 1.3G 52G 3% /data /dev/mapper/vg01-data1 1.7T 373G 1.3T 23% /data1 -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs