Hi,
I inserted the following code in walsender.c:2509(v15.8) to reproduce the issue. { WalSnd *walsnd = MyWalSnd; SpinLockAcquire(&walsnd->mutex); if (walsnd->flush % wal_segment_size == 0 && walsnd->sentPtr == walsnd->flush && walsnd->flush > wal_segment_size) { for (int i = 0; i < max_wal_senders; i++) { WalSnd *walsnder = &WalSndCtl->walsnds[i]; if (walsnder->pid == walsnd->pid) continue; SpinLockAcquire(&walsnder->mutex); if (walsnder->pid == 0) { SpinLockRelease(&walsnder->mutex); continue; } if (walsnder->flush % wal_segment_size == 0 && walsnder->flush > wal_segment_size) elog(PANIC, "simulate primary crashed, flushedlsn %X/%X", LSN_FORMAT_ARGS(walsnder->flush)); SpinLockRelease(&walsnder->mutex); } } SpinLockRelease(&walsnd->mutex); } Regards, Pixian Shi px shi <spxlyy...@gmail.com> 于2024年10月10日周四 15:01写道: > Although I've not tested it because I don't have good way to reproduce >> the problem > > > > I use GDB to reproduce the issue by killing the primary node with kill -9 > when the standby’s flushedlsn was at the begining of a WAL segment. > > > However, this causes to request the primary to stream data from the >> requested location again, even if the standby has already valid data. It >> could be wasteful, for example, when applying WAL record is delayed due to >> conflict with recovery or recovery_min_apply_delay. > > > I had overlooked this problem, thanks for pointing that out. > > I think that comparing walrcv->receiveStart with recptr would be a better > approach. Specifically, if walrcv->receiveStart is greater than recptr, then > reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from > the primary when the standby already has valid data.A new patch has been > submitted. > > > Regards, > > Pixian Shi > > > Yugo NAGATA <nag...@sraoss.co.jp> 于2024年10月9日周三 14:46写道: > >> On Mon, 30 Sep 2024 15:14:54 +0800 >> px shi <spxlyy...@gmail.com> wrote: >> >> > Thanks for responding. >> > >> > >> > > It is odd that the standby server crashes when >> > >> > replication fails because the standby would keep retrying to get the >> > >> > next record even in such case. >> > >> > >> > As I mentioned earlier, when replication fails, it retries to establish >> > streaming replication. At this point, the value of *walrcv->flushedUpto >> *is >> > not necessarily the data actually flushed to disk. However, the startup >> > process mistakenly believes that the latest flushed LSN is >> > *walrcv->flushedUpto* and attempts to open the corresponding WAL file, >> > which doesn't exist, leading to a file open failure and causing the >> startup >> > process to PANIC. >> >> Although I've not tested it because I don't have good way to reproduce >> the problem, >> the patch seems to fix the issue by rewinding walrcv->flushedUpto to the >> requested >> WAL location always when the streaming is requested to start. However, >> this causes >> to request the primary to stream data from the requested location again, >> even if >> the standby has already valid data. It could be wasteful, for example, >> when applying WAL >> record is delayed due to conflict with recovery or >> recovery_min_apply_delay. >> >> It might be better if if could notice that there is not requested record >> in the primary's >> timeline before requesting the streaming, but I don't have a good >> solution for now. >> >> Regards, >> Yugo Nagata >> >> > Regards, >> > Pixian Shi >> > >> > Yugo NAGATA <nag...@sraoss.co.jp> 于2024年9月30日周一 13:47写道: >> > >> > > On Wed, 21 Aug 2024 09:11:03 +0800 >> > > px shi <spxlyy...@gmail.com> wrote: >> > > >> > > > Yugo Nagata <nag...@sraoss.co.jp> 于2024年8月21日周三 00:49写道: >> > > > >> > > > > >> > > > > >> > > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the >> > > standbys >> > > > > of >> > > > > > the primary server respectively, it is not surprising that s2 >> has >> > > > > progressed >> > > > > > far than s1 when the primary fails. I believe that this is the >> case >> > > you >> > > > > should >> > > > > > use pg_rewind. Even if flushedUpto is reset as proposed in your >> > > patch, >> > > > > s2 might >> > > > > > already have applied a WAL record that s1 has not processed >> yet, and >> > > > > there >> > > > > > would be no gurantee that subsecuent applys suceed. >> > > > > >> > > > > >> > > > Thank you for your response. In my scenario, s1 and s2 is the >> standbys >> > > of >> > > > the primary server respectively, and s1 a synchronous standby and >> s2 is >> > > an >> > > > asynchronous standby. You mentioned that if s2's replay progress is >> ahead >> > > > of s1, pg_rewind should be used. However, what I'm trying to >> address is >> > > an >> > > > issue where s2 crashes during replay after s1 has been promoted to >> > > primary, >> > > > even though s2's progress hasn't surpassed s1. >> > > >> > > I understood your point. It is odd that the standby server crashes >> when >> > > replication fails because the standby would keep retrying to get the >> > > next record even in such case. >> > > >> > > Regards, >> > > Yugo Nagata >> > > >> > > > >> > > > Regards, >> > > > Pixian Shi >> > > >> > > >> > > -- >> > > Yugo NAGATA <nag...@sraoss.co.jp> >> > > >> >> >> -- >> Yugo NAGATA <nag...@sraoss.co.jp> >> >