Anybody?

On Jul 3, 2013, at 3:23 PM, Ben Chobot wrote:

> We have an async streaming setup using 9.1.9 and 3 nodes - let's call them A, 
> B, and C. A is the master, B and C are slaves. Today, A crashed, so we made B 
> be the master and told C to follow along with the switch by changing the 
> primary_conninfo in it's recovery.conf, making sure the history file had made 
> it to the WAL archive, then restarting it. That's worked very well for us in 
> the past, but not so much today. When C came back online, it started 
> complaining about missing WALs:
> 
> 2013-07-03T21:22:42.441347+00:00 pgdb41-vpc postgres[25779]: [18184-1] 
> db=,user= LOG:  shutting down
> 2013-07-03T21:22:42.457728+00:00 pgdb41-vpc postgres[25779]: [18185-1] 
> db=,user= LOG:  database system is shut down
> 2013-07-03T21:22:46.852845+00:00 pgdb41-vpc postgres[28942]: [1-1] db=,user= 
> LOG:  database system was shut down in recovery at 2013-07-03 21:22:42 UTC
> 2013-07-03T21:22:46.866127+00:00 pgdb41-vpc postgres[28947]: [1-1] 
> db=[unknown],user=[unknown] LOG:  incomplete startup packet
> 2013-07-03T21:22:47.368871+00:00 pgdb41-vpc postgres[28942]: [2-1] db=,user= 
> LOG:  restored log file "00000010.history" from archive
> 2013-07-03T21:22:47.413588+00:00 pgdb41-vpc postgres[28956]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:22:47.767182+00:00 pgdb41-vpc postgres[28942]: [3-1] db=,user= 
> LOG:  restored log file "00000010.history" from archive
> 2013-07-03T21:22:47.767289+00:00 pgdb41-vpc postgres[28942]: [4-1] db=,user= 
> LOG:  entering standby mode
> 2013-07-03T21:22:47.930394+00:00 pgdb41-vpc postgres[28978]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:22:48.410056+00:00 pgdb41-vpc postgres[28942]: [5-1] db=,user= 
> LOG:  redo starts at 1469/A2604868
> 2013-07-03T21:22:48.445921+00:00 pgdb41-vpc postgres[28986]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:22:48.962090+00:00 pgdb41-vpc postgres[28994]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:22:49.477279+00:00 pgdb41-vpc postgres[29020]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:22:49.993021+00:00 pgdb41-vpc postgres[29027]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:22:50.508848+00:00 pgdb41-vpc postgres[29034]: [1-1] 
> db=postgres,user=postgres FATAL:  the database system is starting up
> 2013-07-03T21:23:30.651775+00:00 pgdb41-vpc postgres[28942]: [6-1] db=,user= 
> LOG:  consistent recovery state reached at 146A/14FFFA8
> 2013-07-03T21:23:30.651805+00:00 pgdb41-vpc postgres[28942]: [7-1] db=,user= 
> LOG:  invalid magic number 0000 in log file 5226, segment 1, offset 5242880
> 2013-07-03T21:23:30.653214+00:00 pgdb41-vpc postgres[28917]: [1-1] db=,user= 
> LOG:  database system is ready to accept read only connections
> 2013-07-03T21:23:31.123588+00:00 pgdb41-vpc postgres[29754]: [2-1] db=,user= 
> LOG:  streaming replication successfully connected to primary
> 2013-07-03T21:23:31.123647+00:00 pgdb41-vpc postgres[29754]: [3-1] db=,user= 
> FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 
> 000000100000146A00000001 has already been removed
> 
> At this point, my understanding of postgres must be wrong, because it appears 
> to me that the slave is looking for WAL 146A/01 because that's where it 
> reached consistent state. However, that was in the previous timeline - we 
> didn't get to the 10 history timeline till 146A/0C:
> 
> # cat 00000010.history 
> 15    0000000F0000146A0000000C        no recovery target specified
> 
> 
> Shouldn't postgres know to be looking for "0000000F0000146A00000001", not 
> "000000100000146A00000001"? I'm trying to see what part of our process we 
> have wrong to have ended up in this state but I'm missing it.
> 
> 
> For what it's worth the new master (node B) certainly seems to have all the 
> WAL files you might expect. Here's some snippets of an ls -l, but all the 
> files are there in between the snippets.
> 
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F000014690000009F
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A0
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A1
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A2
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A3
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A4
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A5
> -rw------- 1 postgres postgres 16777216 Jul  3 21:13 0000000F00001469000000A6
> .
> .
> .
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000000
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000001 
> <- the consistent state seems to be found here
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000002
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000003
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000004
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000005
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000006
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000007
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000008
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A00000009
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A0000000A
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A0000000B
> -rw------- 1 postgres postgres 16777216 Jul  3 21:15 0000000F0000146A0000000C
> -rw------- 1 postgres postgres 16777216 Jul  3 20:56 0000000F0000146A0000000D
> -rw------- 1 postgres postgres 16777216 Jul  3 20:57 0000000F0000146A0000000E
> -rw------- 1 postgres postgres 16777216 Jul  3 20:57 0000000F0000146A0000000F
> -rw------- 1 postgres postgres 16777216 Jul  3 20:57 0000000F0000146A00000010
> -rw------- 1 postgres postgres 16777216 Jul  3 20:57 0000000F0000146A00000011
> .
> .
> .
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000CF
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000D0
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000D1
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000D2
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000D3
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000D4
> -rw------- 1 postgres postgres 16777216 Jul  3 20:58 0000000F0000146A000000D5
> -rw------- 1 postgres postgres 16777216 Jul  3 21:23 000000100000146A0000000C 
> <- timeline switches here
> -rw------- 1 postgres postgres 16777216 Jul  3 21:25 000000100000146A0000000D
> -rw------- 1 postgres postgres 16777216 Jul  3 21:27 000000100000146A0000000E
> -rw------- 1 postgres postgres 16777216 Jul  3 21:28 000000100000146A0000000F
> -rw------- 1 postgres postgres 16777216 Jul  3 21:30 000000100000146A00000010
> -rw------- 1 postgres postgres 16777216 Jul  3 21:32 000000100000146A00000011
> -rw------- 1 postgres postgres 16777216 Jul  3 21:34 000000100000146A00000012
> 
> 
> 
> 
> 
> 

Reply via email to