Hi, On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote: > Included 蔡梦娟 and Jakub Wartak because they've expressed interest on > this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto > variable"). > > As mentioned in the course of thread [1], we're missing a fix for > streaming replication to avoid sending records that the primary hasn't > fully flushed yet. This patch is a first attempt at fixing that problem > by retreating the LSN reported as FlushPtr whenever a segment is > registered, based on the understanding that if no registration exists > then the LogwrtResult.Flush pointer can be taken at face value; but if a > registration exists, then we have to stream only till the start LSN of > that registered entry.
I'm doubtful that the approach of adding awareness of record boundaries is a good path to go down: - It adds nontrivial work to hot code paths to handle an edge case, rather than making rare code paths more expensive. - There are very similar issues with promotions of replicas (consider what happens if we need to promote with the end of local WAL spanning a segment boundary, and what happens to cascading replicas). We have some logic to try to deal with that, but it's pretty grotty and I think incomplete. - It seems to make some future optimizations harder - we should work towards replicating data sooner, rather than the opposite. Right now that's a major bottleneck around syncrep. - Once XLogFlush() for some LSN returned we can write that LSN to disk. The LSN doesn't necessarily have to correspond to a specific on-disk location (it could e.g. be the return value from GetFlushRecPtr()). But "rewinding" to before the last record makes that problematic. - I suspect that schemes with heuristic knowledge of segment boundary spanning records have deadlock or at least latency spike issues. What if synchronous commit needs to flush up to a certain record boundary, but streaming rep doesn't replicate it out because there's segment spanning records both before and after? I think a better approach might be to handle this on the WAL layout level. What if we never overwrite partial records but instead just skipped over them during decoding? Of course there's some difficulties with that - the checksum and the length from the record header aren't going to be meaningful. But we could deal with that using a special flag in the XLogPageHeaderData.xlp_info of the following page. If that flag is set, xlp_rem_len could contain the checksum of the partial record. Greetings, Andres Freund