Hello, Here is a rebased patch, and separate replies to Michael and Michail.
On Sat, Dec 1, 2018 at 4:57 PM Michael Paquier <mich...@paquier.xyz> wrote: > On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote: > > Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on > > top of those, I found myself wondering why syncrep.c thinks it needs > > special treatment for postmaster death. I don't see any reason why we > > shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in > > this new version. If you kill -9 the postmaster, I don't see any > > reason to think that the existing coding is more correct than simply > > exiting immediately. > > Hm. This stuff runs under many assumptions, so I think that we should > be careful here with any changes as the very recent history has proved > (4c70336). If we were to switch WAL senders on postmaster death, I > think that this could be a change independent of what is proposed here. Fair point. I think the effect should be the same with less code: either way you see the server hang up without sending a COMMIT tag, but maybe I'm missing something. Change reverted; let's discuss that another time. On Mon, Dec 3, 2018 at 9:01 AM Michail Nikolaev <michail.nikol...@gmail.com> wrote: > It is really nice feature. I am working on the project which heavily reads > from replicas (6 of them). Thanks for your feedback. > In our case we have implemented some kind of "replication barrier" > functionality based on table with counters (one counter per application > backend in simple case). > Each application backend have dedicated connection to each replica. And it > selects its counter value few times (2-100) per second from each replica in > background process (depending on how often replication barrier is used). Interesting approach. Why don't you sample pg_last_wal_replay_lsn() on all the standbys instead, so you don't have to generate extra write traffic? > Once application have committed transaction it may want join replication > barrier before return new data to a user. So, it increments counter in the > table and waits until all replicas have replayed that value according to > background monitoring process. Of course timeout, replicas health checks and > few optimizations and circuit breakers are used. I'm interested in how you handle failure (taking too long to respond or to see the new counter value, connectivity failure etc). Specifically, if the writer decides to give up on a certain standby (timeout, circuit breaker etc), how should a client that is connected directly to that standby now or soon afterwards know that this standby has been 'dropped' from the replication barrier and it's now at risk of seeing stale data? My patch handles this by cancelling standbys' leases explicitly and waiting for a response (if possible), but otherwise waiting for them to expire (say if connectivity is lost or standby has gone crazy or stopped responding), so that there is no scenario where someone can successfully execute queries on a standby that hasn't applied a transaction that you know to be committed on the primary. > Nice thing here - constant number of connection involved. Even if lot of > threads joining replication barrier in the moment. Even if some replicas are > lagging. > > Because 2-5 seconds lag of some replica will lead to out of connections issue > in few milliseconds in case of implementation described in this thread. Right, if a standby is lagging more than the allowed amount, in my patch the lease is cancelled and it will refuse to handle requests if the GUC is on, with a special new error code, and then it's up to the client to decide what to do. Probably find another node. > It may be the weak part of the patch I think. At least for our case. Could you please elaborate? What could you do that would be better? If the answer is that you just want to know that you might be seeing stale data but for some reason you don't want to have to find a new node, the reader is welcome to turn synchronous_standby off and try again (giving up data freshness guarantees). Not sure when that would be useful though. > But it possible could be used to eliminate odd table with counters in my case > (if it possible to change setting per transaction). Yes, the behaviour can be activated per transaction, using the usual GUC scoping rules. The setting synchronous_replay must be on in both the write transaction and the following read transaction for the logic to work (ie for the writer to wait, and for the reader to make sure that it has a valid lease or raise an error). It sounds like my synchronous_replay GUC is quite similar to your replication barrier system, except that it has a way to handle node failure and excessive lag without abandoning the guarantee. I've attached a small shell script that starts up a primary and N replicas with synchronous_replay configured, in the hope of encouraging you to try it out. -- Thomas Munro http://www.enterprisedb.com
0001-Synchronous-replay-mode-for-avoiding-stale-reads-v10.patch
Description: Binary data
test-synchronous-replay.sh
Description: Bourne shell script