On 1/25/22 10:18, Peter Eisentraut wrote:
On 15.01.22 23:57, Tomas Vondra wrote:
This approach (and also my previous proposal) seems to assume that
the value returned from nextval() should not be used until the
transaction executing that nextval() has been committed successfully.
But I'm not sure how many applications follow this assumption. Some
application might use the return value of nextval() instantly before
issuing commit command. Some might use the return value of nextval()
executed in rollbacked transaction.
IMO any application that assumes data from uncommitted transactions is
outright broken and we should not try to fix that because it's quite
futile (and likely will affect well-behaving applications).
The issue I'm trying to fix in this thread is much narrower - we don't
actually meet the guarantees for committed transactions (that only did
nextval without generating any WAL).
The wording in the SQL standard is:
"Changes to the current base value of a sequence generator are not
controlled by SQL-transactions; therefore, commits and rollbacks of
SQL-transactions have no effect on the current base value of a sequence
generator."
This implies the well-known behavior that consuming a sequence value is
not rolled back. But it also appears to imply that committing a
transaction has no impact on the validity of a sequence value produced
during that transaction. In other words, this appears to imply that
making use of a sequence value produced in a rolled-back transaction is
valid.
A very strict reading of this would seem to imply that every single
nextval() call needs to be flushed to WAL immediately, which is of
course impractical.
I'm not an expert in reading standards, but I'd not interpret it that
way. I think it simply says the sequence must not go back, no matter
what happened to the transaction.
IMO interpreting this as "must not lose any increments from uncommitted
transactions" is maybe a bit too strict, and as you point out it's also
impractical because it'd mean calling nextval() repeatedly flushes WAL
all the time. Not great for batch loads, for example.
I don't think we need to flush WAL for every nextval() call, if we don't
write WAL for every increment - I think we still can batch WAL for 32
increments just like we do now (AFAICS that'd not contradict even this
quite strict interpretation of the standard).
OTOH the flush would have to happen immediately, we can't delay that
until the end of the transaction. Which is going to affect even cases
that generate WAL for other reasons (e.g. doing insert), which was
entirely unaffected by the previous patches.
And the flush would have to happen even for sessions that didn't write
WAL (which was what started this thread) - we could use page LSN and
flush only to that (so we'd flush once and then it'd be noop until the
sequence increments 32-times and writes another WAL record).
Of course, it's not enough to just flush WAL, we have to wait for the
sync replica too :-(
I don't have any benchmark results quantifying this yet, but I'll do
some tests in the next day or two. But my expectation is this is going
to be pretty expensive, and considering how concerned we were about
affecting current workloads, making the impact worse seems wrong.
My opinion is we should focus on fixing this given the current (weaker)
interpretation of the standard, i.e. accepting the loss of increments
observed only by uncommitted transactions. The page LSN patch seems like
the best way to do that so far.
We may try reworking this to provide the stronger guarantees (i.e. not
losing even increments from uncommitted transactions) in the future, of
course. But considering (a) we're not sure that's really what the SQL
standard requires, (b) no one complained about that in years, and (c)
it's going to make sequences way more expensive, I doubt that's really
desirable.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 27cb6307581..0ab95bd3d8e 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -37,6 +37,7 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "parser/parse_type.h"
+#include "replication/syncrep.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
#include "storage/smgr.h"
@@ -812,6 +813,22 @@ nextval_internal(Oid relid, bool check_permissions)
PageSetLSN(page, recptr);
}
+ /*
+ * Make sure the WAL for the sequence is properly flushed, before returning
+ * the nextval result. For sessions that did not generate WAL (and so use a
+ * cached value), this ensures the WAL generated by other sessions reaches
+ * disk. Also, wait for sync replica.
+ *
+ */
+ if (RelationNeedsWAL(seqrel))
+ {
+ XLogRecPtr recptr = PageGetLSN(page);
+
+ XLogFlush(recptr);
+
+ SyncRepWaitForLSN(recptr, true);
+ }
+
/* Now update sequence tuple to the intended final state */
seq->last_value = last; /* last fetched number */
seq->is_called = true;