On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov <i.kartys...@postgrespro.ru> wrote: > Hi hackers, > > Few days earlier I've finished my work on WAITLSN statement utility, so I’d > like to share it. > [...] > Your feedback is welcome! > > [waitlsn_10dev.patch]
Hi Ivan, Thanks for working on this. Here are some general thoughts on the feature, and an initial review. +1 for this feature. Explicitly waiting for a given commit to be applied is one of several approaches to achieve "causal consistency" for reads on replica nodes, and I think it will be very useful if combined with a convenient way to get the values to wait for when you run COMMIT. This could be used either by applications directly, or by middleware that somehow keeps track of dependencies between transactions and inserts waits. I liked the way Heikki Linnakangas imagined this feature[1]: BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE; ... or perhaps it could be spelled like this: BEGIN [isolation stuff] WAIT FOR COMMIT TOKEN <xxx> TIMEOUT <timeout>; That allows waiting only at the start of a transaction, whereas your idea of making a utility command would allow a single READ COMMITTED transaction to wait multiple times for transactions it has heard about through side channels, which may be useful. Perhaps we could support the same syntax in a stand alone statement inside a transaction OR as part of a BEGIN ... statement. Being able to do it as part of BEGIN means that you can use this feature for single-snapshot transactions, ie REPEATABLE READ and SERIALIZABLE (of course you can't use SERIALIZABLE on hot standbys yet but that'll be fixed one day). Otherwise you'd be waiting for the LSN in the middle of your transaction but not be able to see the result because you don't take a new snapshot. Or maybe it's enough to use a standalone WAIT ... statement inside a REPEATABLE READ or SERIALIZABLE transaction as long as it's the first statement, and should be an error to do so any time later? I think working in terms of LSNs or XIDs explicitly is not a good idea: encouraging clients to think in terms of anything other than opaque 'commit tokens' seems like a bad idea because it limits future changes. For example, there is on-going discussion about introducing CSNs (commit sequence numbers), and there are some related concepts lurking in the SSI code; maybe we'd want to use those one day. Do you think it would make sense to have a concept of a commit token that is a non-analysable string as far as clients are concerned, so that clients are not encouraged to do anything at all with them except use them in a WAIT FOR COMMIT TOKEN <xxx> statement? INITIAL FEEDBACK ON THE PATCH I didn't get as far as testing or detailed review because it has some obvious bugs and compiler warnings which I figured we should talk about first, and I also have some higher level questions about the design. gram.y:12882:15: error: assignment makes pointer from integer without a cast [-Werror=int-conversion] n->delay = $3; It looks like struct WaitLSNStmt accidentally has 'delay' as a pointer to int. Perhaps you want an int? Maybe it would be useful to include the units (millisecond, ms) in the name? waitlsn.c: In function 'WLDisownLatch': waitlsn.c:82:2: error: suggest parentheses around assignment used as truth value [-Werror=parentheses] if (MyBackendId = state->backend_maxid) ^~ Pretty sure you want == here. waitlsn.c: In function 'WaitLSNUtility': waitlsn.c:153:17: error: initialization makes integer from pointer without a cast [-Werror=int-conversion] int tdelay = delay; ^~~~~ Another place where I think you wanted an int but used a pointer to int? To fix that warning you need tdelay = *delay, but I think delay should really not be taken by pointer at all. @@ -6922,6 +6923,11 @@ StartupXLOG(void) + /* + * After update lastReplayedEndRecPtr set Latches in SHMEM array + */ + WaitLSNSetLatch(); + I think you should try to do this only after commit records are replayed, not after every record. Only commit records can make transactions visible, and the raison d'être for this feature is to let users wait for transactions they know about to become visible. You probably can't do it directly in xact_redo_commit though, because at that point XLogCtl->lastReplayedEndRecPtr hasn't been updated yet so a backend that wakes up might not see that it has advanced and go back to sleep. It is updated in the StartupXLOG loop after the redo function runs. That is the reason why WalRcvForceReply() is called from there rather than in xact_redo_commit, to implement remote_apply for replication. Perhaps you need something similar? + tdelay -= (GetCurrentTimestamp() - timer); You can't do arithmetic with TimestampTz like this. Depending on configure option --disable-integer-datetimes (which controls macro HAVE_INT64_TIMESTAMP), it may be a floating point number of seconds since the epoch, or an integer number of microseconds since the epoch. It looks like maybe the above code assumes it works in milliseconds, since you're using that to adjust your delay which is in milliseconds? I would try to figure out how to express the logic you want with TimestampTzPlusMilliseconds and TimestampDifference. I'd probably do something like compute the absolute timeout time with TimestampTzPlusMilliseconds(GetCurrentTimestamp(), delay) at the start and then compute the remaining delay each time through the latch wait loop with TimestampDifference(GetCurrentTimestamp(), timeout, ...), though that requires converting (seconds, microseconds) to millisecond. +void +WaitLSNSetLatch(void) +{ + uint i; + for (i = 1; i < (state->backend_maxid+1); i++) + { + SpinLockAcquire(&state->l_arr[i].slock); + if (state->l_arr[i].pid != 0) + SetLatch(&state->l_arr[i].latch); + SpinLockRelease(&state->l_arr[i].slock); + } +} So your approach here is to let regular backends each have their own slot indexed by backend ID, which seems good because it means that they don't have to contend for a lock, but it's bad because it means that the recovery process has to spin through the array looking for backends to wake up every time it advances, and wake them all up no matter whether they are interested in the current LSN or not. That means that they may get woken up many times before they see the value they're waiting for. Did you also consider a design where there would be a wait list/queue, and the recovery process would wake up only those backends that are waiting for LSNs <= the current replayed location? That would make the work for the recovery process cheaper (it just has to pop waiters from one end of a list sorted by the LSN they're waiting for), and let the waiting backends sleep without so many spurious wake-ups, but it would create potential for contention between backends that are calling WAITLSN at the same time because they all need to add themselves to that queue which would involve some kind of mutex. I don't know if that would be better or not, but it's probably the first way that I would have tried to do this. See syncrep.c which does something similar. +static void +WLOwnLatch(void) +{ + SpinLockAcquire(&state->l_arr[MyBackendId].slock); + OwnLatch(&state->l_arr[MyBackendId].latch); + is_latch_owned = true; + if (MyBackendId > state->backend_maxid) + state->backend_maxid += 1; + state->l_arr[MyBackendId].pid = MyProcPid; + SpinLockRelease(&state->l_arr[MyBackendId].slock); +} I'm a bit confused about state->backend_maxid. It looks like you are using that to limit the range of slots that WaitLSNSetLatch has to scan. Isn't it supposed to contain the highest MyBackendId that has ever been seen? You appear to be incrementing backend_maxid by one, instead of recording the index of the highest slot in use, but then WaitLSNSetLatch is using it to restrict the range of indexes. Then there is the question of the synchronisation of access to backend_maxid. You hold a spinlock in one arbitrary slot, but that doesn't seem sufficient: another backend may also read it, compute a new value and then write it, while holding a different spin lock. Or am I missing something? + if (delay > 0) + latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH; + else + latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH; Here you are using delay <= 0 as 'wait forever'. I wonder if it would be useful to have two different special values: one meaning 'wait forever', and another meaning 'don't wait at all: if it's not applied yet, then timeout immediately'. In any case I'd consider using names for special wait times and using those for clarity: WAITLSN_INFINITE_WAIT, WAITLSN_NO_WAIT. Later I'll have feedback on the error messages, documentation and comments but let's talk just about the design and code for now. I hope this is helpful! [1] https://www.postgresql.org/message-id/5642FF8F.4080803%40iki.fi -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers