>> For the benefit of anyone who may be looking at this thread in the
>> archive later, I believe this commit will have fixed this issue:
I can also confirm that indeed the commit
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6672d79
does fix this issue.
Thanks!
--
On Mon, Aug 29, 2022 at 12:54 PM Robert Haas wrote:
> On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi
> wrote:
> > So, it seems that the *standby* received the inconsistent WAL stream
> > (aborted-contrecord not followed by a overwriting-missing-contrecord)
> > from the primary. Thus the incon
On Wed, Aug 10, 2022 at 4:37 AM Kyotaro Horiguchi
wrote:
> So, it seems that the *standby* received the inconsistent WAL stream
> (aborted-contrecord not followed by a overwriting-missing-contrecord)
> from the primary. Thus the inconsistency happened on the primary, not
> on the standby.
>
> So.
Hello.
> Yes, that is correct.
Mmm. I believed that the log came from a single server run, since the
PID (I believe the [359], [357] are PID) did not change through the
log lines.
> 2022-08-05 18:50:13 UTC::@:[359]:LOG: creating missing WAL directory
> "pg_wal/archive_status"
This means that
> The server seem to have started as a standby after crashing a
> primary. Is it correct?
Yes, that is correct.
2022-08-05 17:18:51 UTC::@:[359]:LOG: database system was interrupted; last
known up at 2022-08-05 17:08:52 UTC
2022-08-05 17:18:51 UTC::@:[359]:LOG: creating missing WAL directory
At Fri, 5 Aug 2022 21:28:16 +, "Imseih (AWS), Sami"
wrote in
> > Would you mind trying the second attached to abtain detailed log on
> > your testing environment? With the patch, the modified TAP test yields
> > the log lines like below.
>
> I applied the logging patch to 13.7 ( attached is
> Any luck with this?
Apologies for the delay, as I have been away.
I will test this next week and report back my findings.
Thanks
Sami Imseih
Amazon Web Services (AWS)
On 6/28/22, 2:10 AM, "Kyotaro Horiguchi" wrote:
CAUTION: This email originated from outside of the organization. Do n
Hello,
On 2022-Jun-29, Imseih (AWS), Sami wrote:
> > Would you mind trying the second attached to abtain detailed log on
> > your testing environment? With the patch, the modified TAP test yields
> > the log lines like below.
>
> Thanks for this. I will apply this to the testing environment and
> Would you mind trying the second attached to abtain detailed log on
> your testing environment? With the patch, the modified TAP test yields
> the log lines like below.
Thanks for this. I will apply this to the testing environment and
will share the output.
Regards,
Sami Imseih
Amazon Web Serv
I'd like to look into the WAL segments related to the failure.
Mmm... With the patch, xlogreader->abortedRecPtr is valid only and
always when the last read failed record was an aborted contrec. If
recovery ends here the first insereted record is an "aborted contrec"
record. I still see it as the
At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier wrote
in
> On Fri, Jun 24, 2022 at 04:17:34PM +, Imseih (AWS), Sami wrote:
> > It is been difficult to get a generic repro, but the way we reproduce
> > Is through our test suite. To give more details, we are running tests
> > In which we c
On Fri, Jun 24, 2022 at 04:17:34PM +, Imseih (AWS), Sami wrote:
> It is been difficult to get a generic repro, but the way we reproduce
> Is through our test suite. To give more details, we are running tests
> In which we constantly failover and promote standbys. The issue
> surfaces after we h
>Thus, I still don't see what have happened at Imseih's hand, but I can
>cause PANIC with a bit tricky steps, which I don't think valid. This
>is what I wanted to know the exact steps to cause the PANIC.
>The attached 1 is the PoC of the TAP test (it uses system()..), and
>the
On Mon, Jun 20, 2022 at 9:35 PM Kyotaro Horiguchi
wrote:
> Unfortunately it doesn't work because we read a record already known
> to be complete again at the end of recovery. It is the reason of
> "abortedRecPtr < xlogreader->EndRecPtr" in my PoC patch. Without it,
> abrotedRecPtr is erased when
On Tue, Jun 21, 2022 at 10:35:33AM +0900, Kyotaro Horiguchi wrote:
> At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas wrote
> in
>> says "don't keep trying to read more WAL, just promote RIGHT HERE?". I
>> think this logic would surely be incorrect in that case. It feels to
>> me like the right t
At Mon, 20 Jun 2022 11:57:20 -0400, Robert Haas wrote
in
> It seems to me that what we want to do is: if we're about to start
> allowing WAL writes, then consider whether to insert an aborted
> contrecord record. Now, if we are about to start allowing WAL write,
> we must determine the LSN at wh
On Mon, Jun 20, 2022 at 7:28 AM Kyotaro Horiguchi
wrote:
> > Hmm. I have not looked at that in depth, but if the intention is to
> > check that the database is able to write WAL, looking at
> > XLogCtl->SharedRecoveryState would be the way to go because that's the
> > flip switching between crash
At Mon, 20 Jun 2022 16:13:43 +0900, Michael Paquier wrote
in
> On Fri, May 27, 2022 at 07:01:37PM +, Imseih (AWS), Sami wrote:
> > What we found:
> >
> > 1. missingContrecPtr is set when
> >StandbyMode is false, therefore
> >only a writer should set this value
> >and a record i
On Fri, May 27, 2022 at 07:01:37PM +, Imseih (AWS), Sami wrote:
> What we found:
>
> 1. missingContrecPtr is set when
>StandbyMode is false, therefore
>only a writer should set this value
>and a record is then sent downstream.
>
>But a standby going through crash
>recove
At Fri, 27 May 2022 02:01:27 +, "Imseih (AWS), Sami"
wrote in
> After further research, we found the following.
>
> Testing on 13.6 with the attached patch we see
> that the missingContrecPtr is being incorrectly
> set on the standby and the promote in the tap
> test fails.
>
> Per the com
At Thu, 26 May 2022 19:57:41 +, "Imseih (AWS), Sami"
wrote in
> We see another occurrence of this bug with the last patch applied in 13.7.
>
> After a promotion we observe the following in the logs:
...
> We think it's because VerifyOverwriteContrecord was not
> called which is why we see
On Fri, May 27, 2022 at 08:53:03AM +0900, Michael Paquier wrote:
> This needs a very close lookup, I'll try to check all that except if
> somebody beats me to it.
Please ignore that.. I need more coffee, and likely a break.
--
Michael
signature.asc
Description: PGP signature
On Tue, Feb 22, 2022 at 07:20:55PM +, Imseih (AWS), Sami wrote:
> The overwrite_contrecord was introduced in 13.5 with
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.
>
> Attached is a patch and a TAP test to handle this condition. The
> patch ensures that an ov
We see another occurrence of this bug with the last patch applied in 13.7.
After a promotion we observe the following in the logs:
2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not
satisfied --- flushed only to 7/A860
2022-05-25 00:35:38
UTC:172.31.26.238(38610):
On 2022-Mar-07, Imseih (AWS), Sami wrote:
> I have gone ahead and backpatched this all the way to 10 as well.
Thanks! I pushed this now. I edited the test though: I don't
understand why you went to the trouble of setting stuff in order to call
'pg_ctl promote' (in different ways for older branc
>Nice catch! However, I'm not sure I like the patch.
> * made it through and start writing after the portion that
> persisted.
> * (It's critical to first write an OVERWRITE_CONTRECORD message,
> which
> * we'll do as soon as we're open for writing new WA
At Thu, 24 Feb 2022 16:26:42 +0900 (JST), Kyotaro Horiguchi
wrote in
> So, actually WAL did not ended in an incomplete record. I think
> FinishWalRecover is the last place to do that. (But it could be
> earlier.)
After some investigation, I finally concluded that we should reset
abortedRecPtr
At Wed, 23 Feb 2022 02:58:07 +, "Imseih (AWS), Sami"
wrote in
> >Ooh, nice find and diagnosys. I can confirm that the test fails as you
> >described without the code fix, and doesn't fail with it.
>
> >I attach the same patch, with the test file put in its final place
> >ra
>Ooh, nice find and diagnosys. I can confirm that the test fails as you
>described without the code fix, and doesn't fail with it.
>I attach the same patch, with the test file put in its final place
>rather than as a patch. Due to recent xlog.c changes this need a bit of
>wor
On 2022-Feb-22, Imseih (AWS), Sami wrote:
> On 13.5 a wal flush PANIC is encountered after a standby is promoted.
>
> With debugging, it was found that when a standby skips a missing
> continuation record on recovery, the missingContrecPtr is not
> invalidated after the record is skipped. Therefo
On 13.5 a wal flush PANIC is encountered after a standby is promoted.
With debugging, it was found that when a standby skips a missing continuation
record on recovery, the missingContrecPtr is not invalidated after the record
is skipped. Therefore, when the standby is promoted to a primary it wr
31 matches
Mail list logo