On 2021-Jul-30, Bossart, Nathan wrote: > On 7/30/21, 11:34 AM, "Alvaro Herrera" <alvhe...@alvh.no-ip.org> wrote: > > Hmm ... I'm not sure we're prepared to backpatch this kind of change. > > It seems a bit too disruptive to how replay works. I think patch we > > should be focusing solely on patch 0001 to surgically fix the precise > > bug you see. Does patch 0002 exist because you think that a system with > > only 0001 will not correctly deal with a crash at the right time? > > Yes, that was what I was worried about. However, I just performed a > variety of tests with just 0001 applied, and I am beginning to suspect > my concerns were unfounded. With wal_buffers set very high, > synchronous_commit set to off, and a long sleep at the end of > XLogWrite(), I can reliably cause the archive status files to lag far > behind the current open WAL segment. However, even if I crash at this > time, the .ready files are created when the server restarts (albeit > out of order). This appears to be due to the call to > XLogArchiveCheckDone() in RemoveOldXlogFiles(). Therefore, we can > likely abandon 0002.
That's great to hear. I'll give 0001 a look again. > > Now, the reason I'm looking at this patch series is that we're seeing a > > related problem with walsender/walreceiver, which apparently are capable > > of creating a file in the replica that ends up not existing in the > > primary after a crash, for a reason closely related to what you > > describe for WAL archival. I'm not sure what is going on just yet, so > > I'm not going to try and explain because I'm likely to get it wrong. > > I've suspected that this is due to the use of the flushed location for > the send pointer, which AFAICT needn't align with a WAL record > boundary. Yeah, I had gotten as far as the GetFlushRecPtr but haven't tracked down what happens with a contrecord. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/