Re: Infinite loop in XLogPageRead() on standby

2025-01-19 Thread Michael Paquier
On Thu, Jan 16, 2025 at 09:42:49AM +0900, Michael Paquier wrote: > I've applied the first refactoring bits down to v13 (see for example a > s/emit_message/emit_wal/ tweaked for consistency, with more comment > tweaks). Attached are patches for each branch for the bug fix, that > I'm still testing

Re: Infinite loop in XLogPageRead() on standby

2025-01-15 Thread Michael Paquier
On Wed, Jan 15, 2025 at 10:35:42AM +0100, Alexander Kukushkin wrote: > Thank you for picking it up. I briefly looked at both patches. The actual > fix in XLogPageRead() looks good to me. > I also agree with suggested refactoring, where there is certainly some room > for improvement - $WAL_SEGMENT_

Re: Infinite loop in XLogPageRead() on standby

2025-01-15 Thread Alexander Kukushkin
Hi Michael, On Wed, 15 Jan 2025 at 05:45, Michael Paquier wrote:. > The new regression test is something I really want to keep around, > to be able to emulate the infinite loop, but I got annoyed with the > amount of duplication between the new test and the existing > 039_end_of_wal.pl as there

Re: Infinite loop in XLogPageRead() on standby

2025-01-14 Thread Michael Paquier
On Wed, Dec 25, 2024 at 12:00:59PM +0900, Michael Paquier wrote: > All of them refer to an infinite loop reachable in the startup process > when we read an incorrect incomplete record just after a failover or > when a WAL receiver restarts. Not sure which way is best in order to > fix all of them

Re: Infinite loop in XLogPageRead() on standby

2025-01-09 Thread Michael Paquier
On Fri, Mar 01, 2024 at 01:16:37PM +0900, Kyotaro Horiguchi wrote: > This code intends to prevent a page header error from causing a record > reread, when a record is required to be read from multiple sources. We > could restrict this to only fire at segment boundaries. At segment > boundaries, we

Re: Infinite loop in XLogPageRead() on standby

2024-12-24 Thread Michael Paquier
On Wed, Nov 13, 2024 at 02:18:06PM +0100, Alexander Kukushkin wrote: > Now that v17 is released and before v18 feature freeze we have a few > months, I hope you will find some time to look at it. My apologies for taking a couple of weeks before coming back to this thread. I have been informed a c

Re: Infinite loop in XLogPageRead() on standby

2024-11-13 Thread Alexander Kukushkin
Hi Michael, Now that v17 is released and before v18 feature freeze we have a few months, I hope you will find some time to look at it. On Wed, 5 Jun 2024 at 07:09, Michael Paquier wrote: > On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote: > > Now that beta1 was released I hop

Re: Infinite loop in XLogPageRead() on standby

2024-06-04 Thread Michael Paquier
On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote: > Now that beta1 was released I hope you are not so busy and hence would like > to follow up on this problem. I am still working on something for the v18 cycle that I'd like to present before the beginning of the next commit fest

Re: Infinite loop in XLogPageRead() on standby

2024-06-04 Thread Alexander Kukushkin
Hi Michael and Kyotaro, Now that beta1 was released I hope you are not so busy and hence would like to follow up on this problem. Regards, -- Alexander Kukushkin

Re: Infinite loop in XLogPageRead() on standby

2024-03-15 Thread Ants Aasma
On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi wrote: > > At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi > wrote in > > Oh, I once saw the fix work, but seems not to be working after some > > point. The new issue was a corruption of received WAL records on the > > first standby, an

Re: Infinite loop in XLogPageRead() on standby

2024-03-15 Thread Alexander Kukushkin
Hi Kyotaro, On Wed, 13 Mar 2024 at 03:56, Kyotaro Horiguchi wrote: I identified the cause of the second issue. When I tried to replay the > issue, the second standby accidentally received the old timeline's > last page-spanning record till the end while the first standby was > promoting (but it

Re: Infinite loop in XLogPageRead() on standby

2024-03-12 Thread Kyotaro Horiguchi
At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi wrote in > Oh, I once saw the fix work, but seems not to be working after some > point. The new issue was a corruption of received WAL records on the > first standby, and it may be related to the setting. I identified the cause of the

Re: Infinite loop in XLogPageRead() on standby

2024-03-11 Thread Michael Paquier
On Mon, Mar 11, 2024 at 04:43:32PM +0900, Kyotaro Horiguchi wrote: > At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin > wrote in >> Thank you for spending your time on it! > > You're welcome, but I aplogize for the delay in the work.. Thanks for spending time on this. Everybody is busy

Re: Infinite loop in XLogPageRead() on standby

2024-03-11 Thread Kyotaro Horiguchi
At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin wrote in > Hmm, I think you meant to use wal_segment_size, because 0x10 is just > 1MB. As a result, currently it works for you by accident. Oh, I once saw the fix work, but seems not to be working after some point. The new issue was a c

Re: Infinite loop in XLogPageRead() on standby

2024-03-06 Thread Alexander Kukushkin
Hi Kyotaro, Oh, now I understand what you mean. Is the retry supposed to happen only when we are reading the very first page from the WAL file? On Wed, 6 Mar 2024 at 09:57, Kyotaro Horiguchi wrote: > > xlogrecovery.c: > @@ -3460,8 +3490,10 @@ retry: > * responsible for the validation.

Re: Infinite loop in XLogPageRead() on standby

2024-03-06 Thread Kyotaro Horiguchi
At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin wrote in > Please find attached the patch fixing the problem and the updated TAP test > that addresses Nit. Record-level retries happen when the upper layer detects errors. In my previous mail, I cited code that is intended to prevent this

Re: Infinite loop in XLogPageRead() on standby

2024-03-05 Thread Alexander Kukushkin
Hello Michael, Kyotaro, Please find attached the patch fixing the problem and the updated TAP test that addresses Nit. -- Regards, -- Alexander Kukushkin 042_no_contrecord_switch.pl Description: Perl program diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xl

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi wrote in > Anyway, our current policy here is to avoid record-rereads beyond > source switches. However, fixing this seems to require that source > switches cause record rereads unless some additional information is > available to know

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi > wrote in > > After reading this, I came up with a possibility that walreceiver > > recovers more quickly than the calling interval to > > WaitForWALtoBecomeAvai

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi wrote in > After reading this, I came up with a possibility that walreceiver > recovers more quickly than the calling interval to > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call > to the function WaitForWAL...()

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier wrote in > On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote: > > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi > > wrote: > >> In the first place, it's important to note that we do not guarantee > >> that an async standby c

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Michael Paquier
On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote: > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi > wrote: >> In the first place, it's important to note that we do not guarantee >> that an async standby can always switch its replication connection to >> the old primary or anot

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Alexander Kukushkin
Hi Kyotaro, On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi wrote: In the first place, it's important to note that we do not guarantee > that an async standby can always switch its replication connection to > the old primary or another sibling standby. This is due to the > variations in replicat

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Alexander Kukushkin
Hi Michael, On Thu, 29 Feb 2024 at 06:05, Michael Paquier wrote: > > Wow. Have you seen that in an actual production environment? > Yes, we see it regularly, and it is reproducible in test environments as well. > my $start_page = start_of_page($end_lsn); > my $wal_file = write_wal($primary,

Re: Infinite loop in XLogPageRead() on standby

2024-02-28 Thread Kyotaro Horiguchi
At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier wrote in > On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > > I spent some time debugging an issue with standby not being able to > > continue streaming after failover. > > > > The problem happens when standbys received on

Re: Infinite loop in XLogPageRead() on standby

2024-02-28 Thread Michael Paquier
On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > I spent some time debugging an issue with standby not being able to > continue streaming after failover. > > The problem happens when standbys received only the first part of the WAL > record that spans multiple pages. > In this

Infinite loop in XLogPageRead() on standby

2024-02-28 Thread Alexander Kukushkin
Hello hackers, I spent some time debugging an issue with standby not being able to continue streaming after failover. The problem manifests itself by following messages in the log: LOG: received SIGHUP, reloading configuration files LOG: parameter "primary_conninfo" changed to "port=58669 host=