Hi hackers, Here is a new version of my "causal reads" patch (see the earlier thread from the 9.6 development cycle[1]), which provides a way to avoid stale reads when load balancing with streaming replication.
To try it out: Set up a primary and some standbys, and put "causal_reads_timeout = 4s" in the primary's postgresql.conf. Then SET causal_reads = on and experiment with various workloads, watching your master's log and looking at pg_stat_replication. For example you could try out test-causal-reads.c with --causal-reads --check (from the earlier thread) or write something similar, and verify the behaviour while killing, pausing, overwhelming servers etc. Here's a brief restatement of the problem I'm trying to address and how this patch works: In 9.6 we got a new synchronous_commit level "remote_apply", which causes committing transactions to block until the commit record has been applied on the current synchronous standby server. In 10devel can now be servers plural. That's useful because it means that a client can run tx1 on the primary and then run tx2 on an appropriate standby, or cause some other client to do so, and be sure that tx2 can see tx1. Tx2 can be said to be "causally dependent" on tx1 because clients expect tx2 to see tx1, because they know that tx1 happened before tx2. In practice there are complications relating to failure and transitions. How should you find an appropriate standby? Suppose you have a primary and N standbys, you set synchronous_standby_names to wait for all N standbys, and you set synchronous_commit to remote_apply. Then the above guarantee of visibility of tx1 by tx2 works, no matter which server you run tx2 on. Unfortunately, if one of your standby servers fails or there is a network partition, all commits will block until you fix that. So you probably want to set synchronous_standby_names to wait for a subset of your set of standbys. Now you can lose some number of standby servers without holding up commits on the primary, but the visibility guarantee for causal dependencies is lost! How can a client know for certain whether tx2 run on any given standby can see a transaction tx1 that it has heard about? If you're using the new "ANY n" mode then the subset of standbys that have definitely applied tx1 is not known to any client; if you're using the traditional FIRST mode it's complicated during transitions (you might be talking to a standby that has recently lost its link to the primary and the primary could have decided to wait for the next highest priority standby instead and then returned from COMMIT successfully). This patch provides the following guarantee: if causal_reads is on for both tx1 and tx2, then after tx1 returns, tx2 will either see tx1 or fail with an error indicating that the server is currently unavailable for causal reads. This guarantee is upheld even if there is a network partition and the standby running tx2 is unable to communicate with the primary server, but requires the system clocks of all standbys to differ from the primary's by less than a certain amount of allowable skew that is accounted for in the algorithm (causal_reads_timeout / 4, see README.causal_reads for gory details). It works by sending a stream of "leases" to standbys that are applying fast enough. These leases grant the standby the right to assume that all transactions that were run with causal_reads = on and have returned control have already been applied locally, without doing any communication or waiting, for a limited time. Leases are promises made by the primary that it will wait for all such transactions to be applied on each 'available' standby or for available standbys' leases to be revoked because they're lagging too much, and for any revoked leases to expire. As discussed in the earlier thread, there are other ways that tx2 run on a standby could get a useful guarantee about the visibility of an early transaction tx1 that the client knows about. (1) User-managed "causality tokens": Clients could somehow obtain the LSN of commit tx1 (or later), and then tx2 could explicitly wait for that LSN to be applied, as proposed by Ivan Kartyshov[2] and others; if you aren't also using sync rep for data loss avoidance, then tx1 will return from committing without waiting for standbys, and by the time tx2 starts on a standby it may find that the LSN has already been applied and not have to wait at all. That is definitely good. Unfortunately it also transfers the problem of tracking causal dependencies between transactions to client code, which is a burden on the application developer and difficult to retrofit. (2) Middleware-managed causality tokens: Something like pgpool or pgbouncer or some other proxy could sit in front of all of your PostgreSQL servers and watch all transactions and do the LSN tracking for you, inserting waits where appropriate so that no standby query ever sees a snapshot that doesn't include any commit that any client has heard about; that requires tx2 to wait for transactions that may be later than tx1 to be applied potentially slowing down every read query, and requires pushing all transactions through a single central process thereby introducing its own failover problem with associated transition failure mode that could break our guarantee if somehow two of these proxies are ever active at once. Don't get me wrong, I think those are good ideas: let's do those too. I guess that people working on logical multi-master replication might eventually want a general concept of causality tokens which could include some kind of vector clock. But I don't see this proposal as conflicting with any of that. It's a set of trade-offs that provides a simple solution for users who want to be able to talk directly to any PostgreSQL standby server out of the box without pushing everything through a central observer, and who want to be able to enable this for existing applications without having to rewrite them to insert complicated code to track and communicate LSNs. Some assorted thoughts and things I'd love to hear your ideas on: I admit that it has a potentially confusing relationship with synchronous replication. It is presented as a separate feature, and you can use both features together or use them independently: synchronous_standby_names and synchronous_commit are for controlling your data loss risk, and causal_reads_standby_names and causal_reads are for controlling distributed read consistency. Perhaps the causal_reads GUC should support different levels rather that using on/off; the mode described above could be enabled with something = 'causal_read_lease', leaving room for other modes. Maybe the whole feature needs a better name: I borrowed "causal reads" from Galera's wsrep_causal_reads/wsrep_sync_wait. That system makes readers (think standbys) wait for the global end of WAL to be applied locally at the start of every transaction, which could also be a potential future mode for us, but I thought it was much more interesting to have wait-free reads on standbys, especially if you already happen to be waiting on the primary because you want to avoid data loss with syncrep. To achieve that I added system-clock-based leases. I suspect some people will dislike that part: the guarantee includes the caveat about the maximum difference between system clocks, and the patch doesn't do anything as clever as Google's Spanner/Truetime system or come with a free built-in atomic clock, so it relies on setting the max clock skew conservatively and making sure you have NTP set up correctly (for example, things reportedly got a bit messy for a short time after the recent leap second if you happened to have only one server from pool.ntp.org in your ntpd.conf and were unlucky). I considered ways to make causal reads an extension, but it'd need fairly invasive hooks including the ability to change replication wire protocol messages. Long term, I think it would be pretty cool if we could develop a set of features that give you distributed sequential consistency on top of streaming replication. Something like (this | causality-tokens) + SERIALIZABLE-DEFERRABLE-on-standbys[3] + distributed-dirty-read-prevention[4]. The patch: The replay lag tracking patch this depends on is in the current commitfest[1] and is presented as an independent useful feature. Please find two patches to implement causal reads for the open CF attached. First apply replay-lag-v16.patch, then refactor-syncrep-exit-v16.patch, then causal-reads-v16.patch. Thanks for reading! [1] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=z...@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/0240c26c-9f84-30ea-fca9-93ab2df5f...@postgrespro.ru#0240c26c-9f84-30ea-fca9-93ab2df5f...@postgrespro.ru [3] https://www.postgresql.org/message-id/flat/CAEepm%3D2b9TV%2BvJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ%40mail.gmail.com [4] https://www.postgresql.org/message-id/flat/CAEepm%3D1GNCriNvWhPkWCqrsbXWGtWEEpvA-KnovMbht5ryzbmg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
refactor-syncrep-exit-v16.patch
Description: Binary data
causal-reads-v16.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers