Hi, Here is a new version of the patch with a few small improvements:
1. Adopted the term '[read] lease', replacing various hand-wavy language in the comments and code. That seems to be the established term for this approach[1]. 2. Reduced the stalling time on failure. When things go wrong with a standby (such as losing contact with it), instead of stalling for a conservative amount of time longer than any lease that might have been granted, the primary now stalls only until the expiry of the last lease that actually was granted to a given dropped standby, which should be sooner. 3. Fixed a couple of bugs that showed up in testing and review (some bad flow control in the signal handling, and a bug in a circular buffer), and changed the recovery->walreceiver wakeup signal handling to block the signal except while waiting in walrcv_receive (it didn't seem a good idea to interrupt arbitrary syscalls in walreceiver so I thought that would be a improvement; but of course that area's going to be reworked by Simon's patch anyway, as discussed elsewhere). Restating the central idea using the new terminology: So long as they are replaying fast enough, the primary grants a series of causal reads leases to standbys allowing them to handle causal reads queries locally without any inter-node communication for a limited time. Leases are promises that the primary will wait for the standby to apply commit records OR be dropped from the set of available causal reads standbys and know that it has been dropped, before the primary returns from commit, in order to uphold the causal reads guarantee. In the worst case it can do that by waiting for the most recently granted lease to expire. I've also attached a couple of things which might be useful when trying the patch out: test-causal-reads.c which can be used to test performance and causality under various conditions, and test-causal-reads.sh which can be used to bring up a primary and a bunch of local hot standbys to talk to. (In the hope of encouraging people to take the patch for a spin...) [1] Originally from a well known 1989 paper on caching, but in the context of databases and synchronous replication see for example the recent papers on "Niobe" and "Paxos Quorum Leases" (especially the reference to Google Megastore). Of course a *lot* more is going on in those very different algorithms, but at some level "read leases" are being used to allow local-node-only reads for a limited time while upholding some kind of global consistency guarantee, in some of those consensus database systems. I spent a bit of time talking about consistency levels to database guru and former colleague Alex Scotti who works on a Paxos-based system, and he gave me the initial idea to try out a lease-based consistency system for Postgres streaming rep. It seems like a very useful point in the space of trade-offs to me. -- Thomas Munro http://www.enterprisedb.com
test-causal-reads.sh
Description: Bourne shell script
/* * A simple test program to test performance and visibility with the causal * reads patch. * * Each test loop updates a row on the primary, and then optionally checks if * it can see that change immediately on a standby. If you do this with * standard async replication, you should occasionally see an assertion fail * if run with --check (depending on the vaguaries of timing -- I can * reproduce this very reliably on my system). If you do it with traditional * sync rep, it becomes a little bit less likely (but it's still reliably * reproducible on my system). If you do it with traditional sync rep set up, * and "--synchronous-commit apply" then it should no longer be possible to * trigger than assertion, but that's just a straw-man mode. If you do it * with --causal-reads then you should not be able to reproduce it, no matter * which standby you connect to. If you're using --check and the standby gets * dropped (perhaps because you break/disconnect/pause it etc) you should * never see that assertion fail (= SELECT running but seeing stale data), * instead you should see an error when running the SELECT. * * Arguments: * * --primary <connection-string> how to connect to the primary * --standby <connection-string> how to connect to the standby to check * --check check that the update is visible on standby * --causal-reads enable causal reads * --synchronous-commit LEVEL set synchronous_commit to LEVEL * --loops COUNT how many loops to run through * --verbose chatter */ #include <libpq-fe.h> #include <assert.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char *argv[]) { PGconn *primary; PGconn *standby; PGresult *result; int i; int loops = 10000; char buffer[1024]; const char *synchronous_commit = "on"; bool causal_reads = false; const char *primary_connstr = "dbname=postgres port=5432"; const char *standby_connstr = "dbname=postgres port=5442"; bool check_applied = false; bool verbose = false; for (i = 1; i != argc; ++i) { bool more = (i < argc - 1); if (strcmp(argv[i], "--verbose") == 0) verbose = true; else if (strcmp(argv[i], "--check") == 0) check_applied = true; else if (strcmp(argv[i], "--synchronous-commit") == 0 && more) synchronous_commit = argv[++i]; else if (strcmp(argv[i], "--causal-reads") == 0) causal_reads = true; else if (strcmp(argv[i], "--primary") == 0 && more) primary_connstr = argv[++i]; else if (strcmp(argv[i], "--standby") == 0 && more) standby_connstr = argv[++i]; else if (strcmp(argv[i], "--loops") == 0 && more) loops = atoi(argv[++i]); else { fprintf(stderr, "bad argument\n"); exit(1); } } primary = PQconnectdb(primary_connstr); assert(PQstatus(primary) == CONNECTION_OK); standby = PQconnectdb(standby_connstr); assert(PQstatus(standby) == CONNECTION_OK); snprintf(buffer, sizeof(buffer), "SET synchronous_commit = %s", synchronous_commit); result = PQexec(primary, buffer); assert(PQresultStatus(result) == PGRES_COMMAND_OK); PQclear(result); snprintf(buffer, sizeof(buffer), "SET causal_reads = %s", causal_reads ? "on" : "off"); result = PQexec(primary, buffer); assert(PQresultStatus(result) == PGRES_COMMAND_OK); PQclear(result); snprintf(buffer, sizeof(buffer), "SET synchronous_commit = %s", synchronous_commit); result = PQexec(standby, buffer); assert(PQresultStatus(result) == PGRES_COMMAND_OK); PQclear(result); snprintf(buffer, sizeof(buffer), "SET causal_reads = %s", causal_reads ? "on" : "off"); result = PQexec(standby, buffer); assert(PQresultStatus(result) == PGRES_COMMAND_OK); PQclear(result); result = PQexec(primary, "CREATE TABLE counter AS SELECT 0 AS n"); assert(PQresultStatus(result) == PGRES_COMMAND_OK || strcmp(PQresultErrorField(result, PG_DIAG_SQLSTATE), "42P07") == 0); PQclear(result); for (i = 0; i < loops; ++i) { if (verbose) printf("Updating primary...\n"); snprintf(buffer, sizeof(buffer), "UPDATE counter SET n = %d", i); result = PQexec(primary, buffer); assert(PQresultStatus(result) == PGRES_COMMAND_OK); PQclear(result); if (check_applied) { if (verbose) printf("Checking standby...\n"); snprintf(buffer, sizeof(buffer), "SELECT n FROM counter"); result = PQexec(standby, buffer); assert(PQresultStatus(result) == PGRES_TUPLES_OK); assert(PQntuples(result) == 1); assert(atoi(PQgetvalue(result, 0, 0)) == i); PQclear(result); } } exit(0); }
causal-reads-v3.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers