Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Simon Riggs
On 21 May 2013 09:26, Simon Riggs wrote: > I'm OK with that principle... Well, after fighting some more with that, I've gone with the, er, principle of slightly less ugliness. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Service

Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Simon Riggs
On 21 May 2013 07:46, Heikki Linnakangas wrote: > On 21.05.2013 00:00, Simon Riggs wrote: >> >> When we set the new timeline we should be updating all values that >> might be used elsewhere. If we do that, then no matter when or how we >> run GetXLogReplayRecPtr, it can't ever get it wrong in any

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas
On 21.05.2013 00:00, Simon Riggs wrote: When we set the new timeline we should be updating all values that might be used elsewhere. If we do that, then no matter when or how we run GetXLogReplayRecPtr, it can't ever get it wrong in any backend. --- a/src/backend/access/transam/xlog.c +++ b/src/b

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Simon Riggs
On 20 May 2013 20:40, Heikki Linnakangas wrote: > On 20.05.2013 22:18, Simon Riggs wrote: >> >> On 20 May 2013 18:47, Heikki Linnakangas wrote: >>> >>> Not sure what the best fix would be. Perhaps change the code in >>> >>> CreateRestartPoint() to do something like this instead: >>> >>> GetXLogRe

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas
On 20.05.2013 22:18, Simon Riggs wrote: On 20 May 2013 18:47, Heikki Linnakangas wrote: Not sure what the best fix would be. Perhaps change the code in CreateRestartPoint() to do something like this instead: GetXLogReplayRecPtr(&replayTLI); if (RecoveryInProgress()) ThisTimeLineID = replayT

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Simon Riggs
On 20 May 2013 18:47, Heikki Linnakangas wrote: > On 19.05.2013 17:25, Simon Riggs wrote: >> So while I believe that the checkpointer might have an incorrect TLI >> and that you've seen a bug, what isn't clear is that the checkpointer >> is the only process that would see an incorrect TLI, or why

Re: [HACKERS] Fast promotion failure

2013-05-20 Thread Heikki Linnakangas
On 19.05.2013 17:25, Simon Riggs wrote: However, there is a call to RecoveryInProgress() at the top of the main loop of the checkpointer, which does explicitly state that it "initializes TimeLineID if it's not set yet." The checkpointer makes the decision about whether to run a restartpoint or a

Re: [HACKERS] Fast promotion failure

2013-05-19 Thread Simon Riggs
On 7 May 2013 10:57, Heikki Linnakangas wrote: > While testing the bug from the "Assertion failure at standby promotion", I > bumped into a different bug in fast promotion. When the first checkpoint > after fast promotion is performed, there is no guarantee that the > checkpointer process is runni

Re: [HACKERS] Fast promotion failure

2013-05-16 Thread Amit Kapila
On Thursday, May 16, 2013 11:33 AM Kyotaro HORIGUCHI wrote: > Hello, > > > > >> Is the point of this discussion that the patch may leave out > some > > > >> glich about timing of timeline-related changing and Heikki saw > an > > > >> egress of that? > > > > > > > > AFAIU, the committed patch has s

Re: [HACKERS] Fast promotion failure

2013-05-16 Thread Simon Riggs
On 16 May 2013 07:02, Kyotaro HORIGUCHI wrote: >> > > fast promotion issue. Excuse me for not joining the thread earlier. I'm not available today, but will join in later in my evening. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training

Re: [HACKERS] Fast promotion failure

2013-05-15 Thread Kyotaro HORIGUCHI
Hello, > > >> Is the point of this discussion that the patch may leave out some > > >> glich about timing of timeline-related changing and Heikki saw an > > >> egress of that? > > > > > > AFAIU, the committed patch has some gap in overall scenario which is > > the > > > fast promotion issue. > >

Re: [HACKERS] Fast promotion failure

2013-05-13 Thread Amit Kapila
On Monday, May 13, 2013 1:13 PM Heikki Linnakangas wrote: > On 13.05.2013 06:07, Amit Kapila wrote: > > On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote: > >> Heikki said in the fist message in this thread that he suspected > >> the cause of the failure he had seen to be wrong TLI on whitch

Re: [HACKERS] Fast promotion failure

2013-05-13 Thread Heikki Linnakangas
On 13.05.2013 06:07, Amit Kapila wrote: On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote: Heikki said in the fist message in this thread that he suspected the cause of the failure he had seen to be wrong TLI on whitch checkpointer runs. Nevertheless, the patch you suggested for me looks f

Re: [HACKERS] Fast promotion failure

2013-05-12 Thread Amit Kapila
On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote: > 2013/05/10 20:01 "Amit Kapila" : > > > > C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data > > > from WAL stream: > > > > Is there any chance, that there is any network glitch caused this one > time > > error. > > Unix doma

Re: [HACKERS] Fast promotion failure

2013-05-12 Thread Kyotaro HORIGUCHI
2013/05/10 20:01 "Amit Kapila" : > > > C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data > > from WAL stream: > > Is there any chance, that there is any network glitch caused this one time > error. Unix domam sockets are hardly likely to have such troubles. This test ran within sin

Re: [HACKERS] Fast promotion failure

2013-05-10 Thread Amit Kapila
On Friday, May 10, 2013 2:07 PM Kyotaro HORIGUCHI wrote: > Thank you for noticing me of that. > > > It seems to me, it is the same problem as discussed and fixed in > below > > thread. > > http://www.postgresql.org/message-id/51894942.4080...@vmware.com > > > > Could you try with fixes given by he

Re: [HACKERS] Fast promotion failure

2013-05-10 Thread Kyotaro HORIGUCHI
Thank you for noticing me of that. > It seems to me, it is the same problem as discussed and fixed in below > thread. > http://www.postgresql.org/message-id/51894942.4080...@vmware.com > > Could you try with fixes given by heikki. The first one settles the timeline transition problem for the pr

Re: [HACKERS] Fast promotion failure

2013-05-09 Thread Amit Kapila
On Thursday, May 09, 2013 2:14 PM Kyotaro HORIGUCHI wrote: > With printing some additinal logs, the situation should be more > clear.. > > It seems that Sby-B failes to promote to TLI= 2; nevertheless the > history file for TLI = 2 is somehow sent to sby-C. So sby-B > remains on TLI=1 but sby-C s

Re: [HACKERS] Fast promotion failure

2013-05-09 Thread Kyotaro HORIGUCHI
With printing some additinal logs, the situation should be more clear.. It seems that Sby-B failes to promote to TLI= 2; nevertheless the history file for TLI = 2 is somehow sent to sby-C. So sby-B remains on TLI=1 but sby-C solely switches onto TLI=2. # Come to think of this, I suspect that the

Re: [HACKERS] Fast promotion failure

2013-05-09 Thread Kyotaro HORIGUCHI
Hello, > I think it can so happen that last checkpoint is with old timeline and there > are operations with new timeline which might have caused the problem Heikki > has seen. I suppose to have seen that. After adding an SQL command to modify the DB on standby-B after passive(propagated?) promot

Re: [HACKERS] Fast promotion failure

2013-05-08 Thread Amit Kapila
On Thursday, May 09, 2013 6:29 AM Fujii Masao wrote: > On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas > wrote: > > While testing the bug from the "Assertion failure at standby > promotion", I > > bumped into a different bug in fast promotion. When the first > checkpoint > > after fast promotio

Re: [HACKERS] Fast promotion failure

2013-05-08 Thread Fujii Masao
On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas wrote: > While testing the bug from the "Assertion failure at standby promotion", I > bumped into a different bug in fast promotion. When the first checkpoint > after fast promotion is performed, there is no guarantee that the > checkpointer proce

[HACKERS] Fast promotion failure

2013-05-07 Thread Heikki Linnakangas
While testing the bug from the "Assertion failure at standby promotion", I bumped into a different bug in fast promotion. When the first checkpoint after fast promotion is performed, there is no guarantee that the checkpointer process is running with the correct, new, ThisTimeLineID. In CreateC