Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-31 Thread Imseih (AWS), Sami
>> For the benefit of anyone who may be looking at this thread in the >> archive later, I believe this commit will have fixed this issue: I can also confirm that indeed the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6672d79 does fix this issue. Thanks! --

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-29 Thread Robert Haas
On Mon, Aug 29, 2022 at 12:54 PM Robert Haas wrote: > On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi > wrote: > > So, it seems that the *standby* received the inconsistent WAL stream > > (aborted-contrecord not followed by a overwriting-missing-contrecord) > > from the primary. Thus the incon

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-29 Thread Robert Haas
On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi wrote: > So, it seems that the *standby* received the inconsistent WAL stream > (aborted-contrecord not followed by a overwriting-missing-contrecord) > from the primary. Thus the inconsistency happened on the primary, not > on the standby. > > So.

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-10 Thread Kyotaro Horiguchi
Hello. > Yes, that is correct. Mmm. I believed that the log came from a single server run, since the PID (I believe the [359], [357] are PID) did not change through the log lines. > 2022-08-05 18:50:13 UTC::@:[359]:LOG: creating missing WAL directory > "pg_wal/archive_status" This means that

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-08 Thread Imseih (AWS), Sami
> The server seem to have started as a standby after crashing a > primary. Is it correct? Yes, that is correct. 2022-08-05 17:18:51 UTC::@:[359]:LOG: database system was interrupted; last known up at 2022-08-05 17:08:52 UTC 2022-08-05 17:18:51 UTC::@:[359]:LOG: creating missing WAL directory

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-07 Thread Kyotaro Horiguchi
At Fri, 5 Aug 2022 21:28:16 +, "Imseih (AWS), Sami" wrote in > > Would you mind trying the second attached to abtain detailed log on > > your testing environment? With the patch, the modified TAP test yields > > the log lines like below. > > I applied the logging patch to 13.7 ( attached is

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-07-29 Thread Imseih (AWS), Sami
> Any luck with this? Apologies for the delay, as I have been away. I will test this next week and report back my findings. Thanks Sami Imseih Amazon Web Services (AWS) On 6/28/22, 2:10 AM, "Kyotaro Horiguchi" wrote: CAUTION: This email originated from outside of the organization. Do n

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-07-28 Thread alvhe...@alvh.no-ip.org
Hello, On 2022-Jun-29, Imseih (AWS), Sami wrote: > > Would you mind trying the second attached to abtain detailed log on > > your testing environment? With the patch, the modified TAP test yields > > the log lines like below. > > Thanks for this. I will apply this to the testing environment and

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-29 Thread Imseih (AWS), Sami
> Would you mind trying the second attached to abtain detailed log on > your testing environment? With the patch, the modified TAP test yields > the log lines like below. Thanks for this. I will apply this to the testing environment and will share the output. Regards, Sami Imseih Amazon Web Serv

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-28 Thread Kyotaro Horiguchi
I'd like to look into the WAL segments related to the failure. Mmm... With the patch, xlogreader->abortedRecPtr is valid only and always when the last read failed record was an aborted contrec. If recovery ends here the first insereted record is an "aborted contrec" record. I still see it as the

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-27 Thread Kyotaro Horiguchi
At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier wrote in > On Fri, Jun 24, 2022 at 04:17:34PM +, Imseih (AWS), Sami wrote: > > It is been difficult to get a generic repro, but the way we reproduce > > Is through our test suite. To give more details, we are running tests > > In which we c

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-26 Thread Michael Paquier
On Fri, Jun 24, 2022 at 04:17:34PM +, Imseih (AWS), Sami wrote: > It is been difficult to get a generic repro, but the way we reproduce > Is through our test suite. To give more details, we are running tests > In which we constantly failover and promote standbys. The issue > surfaces after we h

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-24 Thread Imseih (AWS), Sami
>Thus, I still don't see what have happened at Imseih's hand, but I can >cause PANIC with a bit tricky steps, which I don't think valid. This >is what I wanted to know the exact steps to cause the PANIC. >The attached 1 is the PoC of the TAP test (it uses system()..), and >the

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-22 Thread Robert Haas
On Mon, Jun 20, 2022 at 9:35 PM Kyotaro Horiguchi wrote: > Unfortunately it doesn't work because we read a record already known > to be complete again at the end of recovery. It is the reason of > "abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch. Without it, > abrotedRecPtr is erased when

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-21 Thread Michael Paquier
On Tue, Jun 21, 2022 at 10:35:33AM +0900, Kyotaro Horiguchi wrote: > At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas wrote > in >> says "don't keep trying to read more WAL, just promote RIGHT HERE?". I >> think this logic would surely be incorrect in that case. It feels to >> me like the right t

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-20 Thread Kyotaro Horiguchi
At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas wrote in > It seems to me that what we want to do is: if we're about to start > allowing WAL writes, then consider whether to insert an aborted > contrecord record. Now, if we are about to start allowing WAL write, > we must determine the LSN at wh

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-20 Thread Robert Haas
On Mon, Jun 20, 2022 at 7:28 AM Kyotaro Horiguchi wrote: > > Hmm. I have not looked at that in depth, but if the intention is to > > check that the database is able to write WAL, looking at > > XLogCtl->SharedRecoveryState would be the way to go because that's the > > flip switching between crash

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-20 Thread Kyotaro Horiguchi
At Mon, 20 Jun 2022 16:13:43 +0900, Michael Paquier wrote in > On Fri, May 27, 2022 at 07:01:37PM +, Imseih (AWS), Sami wrote: > > What we found: > > > > 1. missingContrecPtr is set when > >StandbyMode is false, therefore > >only a writer should set this value > >and a record i

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-06-20 Thread Michael Paquier
On Fri, May 27, 2022 at 07:01:37PM +, Imseih (AWS), Sami wrote: > What we found: > > 1. missingContrecPtr is set when >StandbyMode is false, therefore >only a writer should set this value >and a record is then sent downstream. > >But a standby going through crash >recove

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-05-26 Thread Kyotaro Horiguchi
At Fri, 27 May 2022 02:01:27 +, "Imseih (AWS), Sami" wrote in > After further research, we found the following. > > Testing on 13.6 with the attached patch we see > that the missingContrecPtr is being incorrectly > set on the standby and the promote in the tap > test fails. > > Per the com

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-05-26 Thread Kyotaro Horiguchi
At Thu, 26 May 2022 19:57:41 +, "Imseih (AWS), Sami" wrote in > We see another occurrence of this bug with the last patch applied in 13.7. > > After a promotion we observe the following in the logs: ... > We think it's because VerifyOverwriteContrecord was not > called which is why we see

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-05-26 Thread Michael Paquier
On Fri, May 27, 2022 at 08:53:03AM +0900, Michael Paquier wrote: > This needs a very close lookup, I'll try to check all that except if > somebody beats me to it. Please ignore that.. I need more coffee, and likely a break. -- Michael signature.asc Description: PGP signature

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-05-26 Thread Michael Paquier
On Tue, Feb 22, 2022 at 07:20:55PM +, Imseih (AWS), Sami wrote: > The overwrite_contrecord was introduced in 13.5 with > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24. > > Attached is a patch and a TAP test to handle this condition. The > patch ensures that an ov

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-05-26 Thread Imseih (AWS), Sami
We see another occurrence of this bug with the last patch applied in 13.7. After a promotion we observe the following in the logs: 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A860 2022-05-25 00:35:38 UTC:172.31.26.238(38610):

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-03-23 Thread alvhe...@alvh.no-ip.org
On 2022-Mar-07, Imseih (AWS), Sami wrote: > I have gone ahead and backpatched this all the way to 10 as well. Thanks! I pushed this now. I edited the test though: I don't understand why you went to the trouble of setting stuff in order to call 'pg_ctl promote' (in different ways for older branc

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-24 Thread Imseih (AWS), Sami
>Nice catch! However, I'm not sure I like the patch. > * made it through and start writing after the portion that > persisted. > * (It's critical to first write an OVERWRITE_CONTRECORD message, > which > * we'll do as soon as we're open for writing new WA

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-24 Thread Kyotaro Horiguchi
At Thu, 24 Feb 2022 16:26:42 +0900 (JST), Kyotaro Horiguchi wrote in > So, actually WAL did not ended in an incomplete record. I think > FinishWalRecover is the last place to do that. (But it could be > earlier.) After some investigation, I finally concluded that we should reset abortedRecPtr

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-23 Thread Kyotaro Horiguchi
At Wed, 23 Feb 2022 02:58:07 +, "Imseih (AWS), Sami" wrote in > >Ooh, nice find and diagnosys. I can confirm that the test fails as you > >described without the code fix, and doesn't fail with it. > > >I attach the same patch, with the test file put in its final place > >ra

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Imseih (AWS), Sami
>Ooh, nice find and diagnosys. I can confirm that the test fails as you >described without the code fix, and doesn't fail with it. >I attach the same patch, with the test file put in its final place >rather than as a patch. Due to recent xlog.c changes this need a bit of >wor

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Alvaro Herrera
On 2022-Feb-22, Imseih (AWS), Sami wrote: > On 13.5 a wal flush PANIC is encountered after a standby is promoted. > > With debugging, it was found that when a standby skips a missing > continuation record on recovery, the missingContrecPtr is not > invalidated after the record is skipped. Therefo

[BUG] Panic due to incorrect missingContrecPtr after promotion

2022-02-22 Thread Imseih (AWS), Sami
On 13.5 a wal flush PANIC is encountered after a standby is promoted. With debugging, it was found that when a standby skips a missing continuation record on recovery, the missingContrecPtr is not invalidated after the record is skipped. Therefore, when the standby is promoted to a primary it wr