((On Tue, 21 Jan 2020 at 11:06, Michael Paquier <mich...@paquier.xyz> wrote:
> > The replication interface should not immediately flush changes to the > > slot replay position on advance. It should be marked dirty and left to > > be flushed by the next checkpoint. Doing otherwise potentially > > introduces a lot of unnecessary fsync()s and may have an unpleasant > > impact on performance. > > Some portions of the advancing code tells a different story. It seems > to me that the intention behind the first implementation of slot > advancing was to get things flushed if any advancing was done. Taking a step back here, I have no concerns with proposed changes for pg_replication_slot_advance(). Disregard my comments about safety with the SQL interface for the purposes of this thread, they apply only to logical slots and are really unrelated to pg_replication_slot_advance(). Re your comment above: For slot advances in general the flush to disk is done lazily for performance reasons, but I think you meant pg_replication_slot_advance() specifically. pg_replication_slot_advance() doesn't appear to make any promises as to immediate durability either way. It updates the required LSN immediately with ReplicationSlotsUpdateRequiredLSN() so it theoretically marks WAL as removable before it's flushed. But I don't think we'll ever actually remove any WAL segments until checkpoint, at which point we'll also flush any dirty slots, so it doesn't really matter. For logical slots the lsn and xmin are both protected by the effective/actual tracking logic and can't advance until the slot is flushed. The app might be surprised if the slot goes backwards after an pg_replication_slot_advance() followed by a server crash though. > The > check doing that is actually broken from the start, but that's another > story. Could you check with Petr what was the intention here or drag > his attention to this thread? He is the original author of the > feature. So his output would be nice to have. I'll ask him. He's pretty bogged at the moment though, and I've done a lot of work in this area too. (See e.g. the catalog_xmin in hot standby feedback changes). > > The SQL interface advances the slot restart position and marks the > > slot dirty *before the client has confirmed receipt of the data and > > flushed it to disk*. So there's a data loss window. If the client > > disconnects or crashes before all the data from that function call is > > safely flushed to disk it may lose the data, then be unable to fetch > > it again from the server because of the restart_lsn position advance. > > Well, you have the same class of problems with for example synchronous > replication. The only state a client can be sure of is that it > received a confirmation that the operation happens and completed. > Any other state tells that the operation may have happened. Or not. > Now, being sure that the data of the new slot has been flushed once > the advancing function is done once the client got the confirmation > that the work is done is a property which could be interesting to some > class of applications. That's what we already provide for the streaming interface for slot access. I agree there's no need to shove a fix to the SQL interface for phys/logical slots into this same discussion. I'm just trying to make sure we don't "fix" a "bug" that's actually an important part of the design by trying to fix a perceived-missing flush in the streaming interface too. I am not at all confident that the test coverage for this is sufficient right now, since we lack a good way to make postgres delay various lazy internal activity to let us reliably examine intermediate states in a race-free way, so I'm not sure tests would catch it. > > Really, we should add a "no_advance_position" option to the SQL > > interface, then expect the client to call a second function that > > explicitly advances the restart_lsn and confirmed_flush_lsn. Otherwise > > no SQL interface client can be crashsafe. > > Hm. Could you elaborate more what you mean here? I am not sure to > understand. Note that calling the advance function multiple times has > no effects, and the result returned is the actual restart_lsn (or > confirmed_flush_lsn of course). I've probably confused things a bit here. I don't mind if whether or not pg_replication_slot_advance() forces an immediate flush, I think there are reasonable arguments in both directions. In the above I was talking about how pg_logical_slot_get_changes() presently advances the slot position immediately, so if the client loses its connection before reading and flushing all the data it may be unable to recover. And while pg_logical_slot_peek_changes() lets the app read the data w/o advancing the slot, it has to then do a separate pg_replication_slot_advance() which has to do the decoding work again. I'd like to improve that, but I didn't intend to confuse or sidetrack this discussion in the process. Sorry. We don't have a SQL-level interface for reading physical WAL so there are no corresponding concerns there. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise