Anybody? On Jul 3, 2013, at 3:23 PM, Ben Chobot wrote:
> We have an async streaming setup using 9.1.9 and 3 nodes - let's call them A, > B, and C. A is the master, B and C are slaves. Today, A crashed, so we made B > be the master and told C to follow along with the switch by changing the > primary_conninfo in it's recovery.conf, making sure the history file had made > it to the WAL archive, then restarting it. That's worked very well for us in > the past, but not so much today. When C came back online, it started > complaining about missing WALs: > > 2013-07-03T21:22:42.441347+00:00 pgdb41-vpc postgres[25779]: [18184-1] > db=,user= LOG: shutting down > 2013-07-03T21:22:42.457728+00:00 pgdb41-vpc postgres[25779]: [18185-1] > db=,user= LOG: database system is shut down > 2013-07-03T21:22:46.852845+00:00 pgdb41-vpc postgres[28942]: [1-1] db=,user= > LOG: database system was shut down in recovery at 2013-07-03 21:22:42 UTC > 2013-07-03T21:22:46.866127+00:00 pgdb41-vpc postgres[28947]: [1-1] > db=[unknown],user=[unknown] LOG: incomplete startup packet > 2013-07-03T21:22:47.368871+00:00 pgdb41-vpc postgres[28942]: [2-1] db=,user= > LOG: restored log file "00000010.history" from archive > 2013-07-03T21:22:47.413588+00:00 pgdb41-vpc postgres[28956]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:22:47.767182+00:00 pgdb41-vpc postgres[28942]: [3-1] db=,user= > LOG: restored log file "00000010.history" from archive > 2013-07-03T21:22:47.767289+00:00 pgdb41-vpc postgres[28942]: [4-1] db=,user= > LOG: entering standby mode > 2013-07-03T21:22:47.930394+00:00 pgdb41-vpc postgres[28978]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:22:48.410056+00:00 pgdb41-vpc postgres[28942]: [5-1] db=,user= > LOG: redo starts at 1469/A2604868 > 2013-07-03T21:22:48.445921+00:00 pgdb41-vpc postgres[28986]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:22:48.962090+00:00 pgdb41-vpc postgres[28994]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:22:49.477279+00:00 pgdb41-vpc postgres[29020]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:22:49.993021+00:00 pgdb41-vpc postgres[29027]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:22:50.508848+00:00 pgdb41-vpc postgres[29034]: [1-1] > db=postgres,user=postgres FATAL: the database system is starting up > 2013-07-03T21:23:30.651775+00:00 pgdb41-vpc postgres[28942]: [6-1] db=,user= > LOG: consistent recovery state reached at 146A/14FFFA8 > 2013-07-03T21:23:30.651805+00:00 pgdb41-vpc postgres[28942]: [7-1] db=,user= > LOG: invalid magic number 0000 in log file 5226, segment 1, offset 5242880 > 2013-07-03T21:23:30.653214+00:00 pgdb41-vpc postgres[28917]: [1-1] db=,user= > LOG: database system is ready to accept read only connections > 2013-07-03T21:23:31.123588+00:00 pgdb41-vpc postgres[29754]: [2-1] db=,user= > LOG: streaming replication successfully connected to primary > 2013-07-03T21:23:31.123647+00:00 pgdb41-vpc postgres[29754]: [3-1] db=,user= > FATAL: could not receive data from WAL stream: FATAL: requested WAL segment > 000000100000146A00000001 has already been removed > > At this point, my understanding of postgres must be wrong, because it appears > to me that the slave is looking for WAL 146A/01 because that's where it > reached consistent state. However, that was in the previous timeline - we > didn't get to the 10 history timeline till 146A/0C: > > # cat 00000010.history > 15 0000000F0000146A0000000C no recovery target specified > > > Shouldn't postgres know to be looking for "0000000F0000146A00000001", not > "000000100000146A00000001"? I'm trying to see what part of our process we > have wrong to have ended up in this state but I'm missing it. > > > For what it's worth the new master (node B) certainly seems to have all the > WAL files you might expect. Here's some snippets of an ls -l, but all the > files are there in between the snippets. > > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F000014690000009F > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A0 > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A1 > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A2 > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A3 > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A4 > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A5 > -rw------- 1 postgres postgres 16777216 Jul 3 21:13 0000000F00001469000000A6 > . > . > . > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000000 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000001 > <- the consistent state seems to be found here > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000002 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000003 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000004 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000005 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000006 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000007 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000008 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A00000009 > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A0000000A > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A0000000B > -rw------- 1 postgres postgres 16777216 Jul 3 21:15 0000000F0000146A0000000C > -rw------- 1 postgres postgres 16777216 Jul 3 20:56 0000000F0000146A0000000D > -rw------- 1 postgres postgres 16777216 Jul 3 20:57 0000000F0000146A0000000E > -rw------- 1 postgres postgres 16777216 Jul 3 20:57 0000000F0000146A0000000F > -rw------- 1 postgres postgres 16777216 Jul 3 20:57 0000000F0000146A00000010 > -rw------- 1 postgres postgres 16777216 Jul 3 20:57 0000000F0000146A00000011 > . > . > . > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000CF > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000D0 > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000D1 > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000D2 > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000D3 > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000D4 > -rw------- 1 postgres postgres 16777216 Jul 3 20:58 0000000F0000146A000000D5 > -rw------- 1 postgres postgres 16777216 Jul 3 21:23 000000100000146A0000000C > <- timeline switches here > -rw------- 1 postgres postgres 16777216 Jul 3 21:25 000000100000146A0000000D > -rw------- 1 postgres postgres 16777216 Jul 3 21:27 000000100000146A0000000E > -rw------- 1 postgres postgres 16777216 Jul 3 21:28 000000100000146A0000000F > -rw------- 1 postgres postgres 16777216 Jul 3 21:30 000000100000146A00000010 > -rw------- 1 postgres postgres 16777216 Jul 3 21:32 000000100000146A00000011 > -rw------- 1 postgres postgres 16777216 Jul 3 21:34 000000100000146A00000012 > > > > > >