Hi Hackers,
On 9/5/24 01:39, Amit Kapila wrote:
On Wed, Sep 4, 2024 at 6:35 PM Aleksander Alekseev
<aleksan...@timescale.com> wrote:
> > I don't think you can rely on a system clock for conflict resolution.
> > In a corner case a DBA can move the clock forward or backward between
> > recordings of Ts1 and Ts2. On top of that there is no guarantee that
> > 2+ servers have synchronised clocks. It seems to me that what you are
> > proposing will just hide the problem instead of solving it in the
> > general case.
> >
>
> It is possible that we can't rely on the system clock for conflict
> resolution but that is not the specific point of this thread. As
> mentioned in the subject of this thread, the problem is "Commit
> Timestamp and LSN Inversion issue". The LSN value and timestamp for a
> commit are not generated atomically, so two different transactions can
> have them in different order.
Hm.... Then I'm having difficulties understanding why this is a
problem
This is a potential problem pointed out during discussion of CDR [1]
(Please read the point starting from "How is this going to deal .."
and response by Shveta). The point of this thread is that though it
appears to be a problem but practically there is no scenario where it
can impact even when we implement "last_write_wins" startegy as
explained in the initial email. If you or someone sees a problem due
to LSN<->timestamp inversion then we need to explore the solution for
it.
and why it was necessary to mention CDR in this context in the
first place.
OK, let's forget about CDR completely. Who is affected by the current
behavior and why would it be beneficial changing it?
We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.
I agree that we can't forget about CDR. This is precisely the problem we
ran into here at pgEdge and why we came up with a solution (attached).
Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.
Fact is that "last_write_wins" together with some implementation of
Conflict free Replicated Data Types (CRDT) is good enough for many real
world situations. Anything resembling a TPC-B or TPC-C is quite happy
with it.
The attached solution is minimally invasive because it doesn't move the
timestamp generation (clock_gettime() call) into the critical section of
ReserveXLogInsertLocation() that is protected by a spinlock. Instead it
keeps track of the last commit-ts written to WAL in shared memory and
simply bumps that by one microsecond if the next one is below or equal.
There is one extra condition in that code section plus a function call
by pointer for every WAL record. In the unlikely case of encountering a
stalled or backwards running commit-ts, the expensive part of
recalculating the CRC of the altered commit WAL-record is done later,
after releasing the spinlock. I have not been able to measure any
performance impact on a machine with 2x Xeon-Silver (32 HT cores).
This will work fine until we have systems that can sustain a commit rate
of one million transactions per second or higher for more than a few
milliseconds.
Regards, Jan
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 004f7e10e5..c0d2466f41 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -371,6 +371,12 @@ static void ShowTransactionStateRec(const char *str, TransactionState s);
static const char *BlockStateAsString(TBlockState blockState);
static const char *TransStateAsString(TransState state);
+/*
+ * Hook function to be called while holding the WAL insert spinlock.
+ * to adjust commit timestamps via Lamport clock if needed.
+ */
+static void EnsureMonotonicTransactionStopTimestamp(void *data);
+
/* ----------------------------------------------------------------
* transaction state accessors
@@ -2215,6 +2221,13 @@ StartTransaction(void)
if (TransactionTimeout > 0)
enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
+ /*
+ * Reset XLogReserveInsertHook (function called while holding
+ * the WAL insert spinlock)
+ */
+ XLogReserveInsertHook = NULL;
+ XLogReserveInsertHookData = NULL;
+
ShowTransactionState("StartTransaction");
}
@@ -5837,6 +5850,7 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xact_twophase xl_twophase;
xl_xact_origin xl_origin;
uint8 info;
+ XLogRecPtr result;
Assert(CritSectionCount > 0);
@@ -5980,7 +5994,19 @@ XactLogCommitRecord(TimestampTz commit_time,
/* we allow filtering by xacts */
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
- return XLogInsert(RM_XACT_ID, info);
+ /*
+ * Install our hook for the call to ReserveXLogInsertLocation() so that
+ * we can modify the xactStopTimestamp and the xact_time of the xlrec
+ * while holding the lock that determines the commit-LSN to ensure
+ * the commit timestamps are monotonically increasing.
+ */
+ XLogReserveInsertHook = EnsureMonotonicTransactionStopTimestamp;
+ XLogReserveInsertHookData = (void *)&xlrec;
+ result = XLogInsert(RM_XACT_ID, info);
+ XLogReserveInsertHook = NULL;
+ XLogReserveInsertHookData = NULL;
+
+ return result;
}
/*
@@ -6451,3 +6477,40 @@ xact_redo(XLogReaderState *record)
else
elog(PANIC, "xact_redo: unknown op code %u", info);
}
+
+/*
+ * Hook function used in XactLogCommitRecord() to ensure that the
+ * commit timestamp is monotonically increasing in commit-LSN order.
+ */
+static void
+EnsureMonotonicTransactionStopTimestamp(void *data)
+{
+ xl_xact_commit *xlrec = (xl_xact_commit *)data;
+ TimestampTz logical_clock;
+
+ logical_clock = XLogGetLastTransactionStopTimestamp();
+
+ /*
+ * This is a local transaction. Make sure that the xact_time
+ * higher than any timestamp we have seen thus far.
+ *
+ * TODO: This is not postmaster restart safe. If the local
+ * system clock is further behind other nodes than it takes
+ * for the postmaster to restart (time between it stops
+ * accepting new transactions and time when it becomes ready
+ * to accept new transactions), local transactions will not
+ * be bumped into the future correctly.
+ */
+ if (logical_clock >= xlrec->xact_time)
+ {
+ logical_clock++;
+ xlrec->xact_time = logical_clock;
+ xactStopTimestamp = logical_clock;
+
+ XLogReserveInsertHookModifiedRecord = true;
+ }
+ else
+ logical_clock = xlrec->xact_time;
+
+ XLogSetLastTransactionStopTimestamp(logical_clock);
+}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f14d3933ae..e1d3122445 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -156,6 +156,16 @@ int wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
*/
int CheckPointSegments;
+/*
+ * Hook to be called inside of ReserveXLogInsertLocation() for
+ * operations that need to be performed while holding the WAL
+ * insert spinlock. Currently this is used to guarantee a monotonically
+ * increasing commit-timestamp in LSN order.
+ */
+XLogReserveInsertHookType XLogReserveInsertHook = NULL;
+void *XLogReserveInsertHookData = NULL;
+bool XLogReserveInsertHookModifiedRecord = false;
+
/* Estimated distance between checkpoints, in bytes */
static double CheckPointDistanceEstimate = 0;
static double PrevCheckPointDistance = 0;
@@ -552,6 +562,14 @@ typedef struct XLogCtlData
XLogRecPtr lastFpwDisableRecPtr;
slock_t info_lck; /* locks shared variables shown above */
+
+ /*
+ * This is our shared, logical clock that we use to force
+ * commit timestamps to be monotonically increasing in
+ * commit-LSN order. This is protected by the Wal-insert
+ * spinlock.
+ */
+ TimestampTz lastTransactionStopTimestamp;
} XLogCtlData;
/*
@@ -701,6 +719,7 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
XLogRecData *rdata,
XLogRecPtr StartPos, XLogRecPtr EndPos,
TimeLineID tli);
+static void XLogRecordCorrectCRC(XLogRecData *rdata);
static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -779,6 +798,12 @@ XLogInsertRecord(XLogRecData *rdata,
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
+ /*
+ * Make sure the flag telling that ReserveXLog...() modified the
+ * record is false at this point.
+ */
+ XLogReserveInsertHookModifiedRecord = false;
+
/*
* Given that we're not in recovery, InsertTimeLineID is set and can't
* change, so we can read it without a lock.
@@ -907,6 +932,16 @@ XLogInsertRecord(XLogRecData *rdata,
if (inserted)
{
+ /*
+ * If our reserve hook modified the XLog Record,
+ * recalculate the CRC.
+ */
+ if (XLogReserveInsertHookModifiedRecord)
+ {
+ XLogRecordCorrectCRC(rdata);
+ XLogReserveInsertHookModifiedRecord = false;
+ }
+
/*
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
@@ -1087,6 +1122,25 @@ XLogInsertRecord(XLogRecData *rdata,
return EndPos;
}
+/*
+ * Function to recalculate the WAL Record's CRC in case it was
+ * altered during the callback from ReserveXLogInsertLocation().
+ */
+static void
+XLogRecordCorrectCRC(XLogRecData *rdata)
+{
+ XLogRecData *rdt;
+ XLogRecord *rechdr = (XLogRecord *)rdata->data;
+ pg_crc32c rdata_crc;
+
+ INIT_CRC32C(rdata_crc);
+ COMP_CRC32C(rdata_crc, rdata->data + SizeOfXLogRecord, rdata->len - SizeOfXLogRecord);
+ for (rdt = rdata->next; rdt != NULL; rdt = rdt->next)
+ COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+
+ rechdr->xl_crc = rdata_crc;
+}
+
/*
* Reserves the right amount of space for a record of given size from the WAL.
* *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1131,6 +1185,12 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
*/
SpinLockAcquire(&Insert->insertpos_lck);
+ /*
+ * If set call the XLogReserveInsertHook function
+ */
+ if (XLogReserveInsertHook != NULL)
+ XLogReserveInsertHook(XLogReserveInsertHookData);
+
startbytepos = Insert->CurrBytePos;
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -1190,6 +1250,12 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
return false;
}
+ /*
+ * If set call the XLogReserveInsertHook function
+ */
+ if (XLogReserveInsertHook != NULL)
+ XLogReserveInsertHook(XLogReserveInsertHookData);
+
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
@@ -9517,3 +9583,19 @@ SetWalWriterSleeping(bool sleeping)
XLogCtl->WalWriterSleeping = sleeping;
SpinLockRelease(&XLogCtl->info_lck);
}
+
+/*
+ * Get/set the last-transaction-stop-timestamp in shared memory.
+ * Caller must ensure that the WAL-insert spinlock is held.
+ */
+extern TimestampTz
+XLogGetLastTransactionStopTimestamp(void)
+{
+ return XLogCtl->lastTransactionStopTimestamp;
+}
+
+extern void
+XLogSetLastTransactionStopTimestamp(TimestampTz ts)
+{
+ XLogCtl->lastTransactionStopTimestamp = ts;
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..187d1dc9bc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -199,6 +199,17 @@ typedef enum WALAvailability
struct XLogRecData;
struct XLogReaderState;
+/*
+ * Hook called from inside of holding the lock that determines
+ * the LSN order of WAL records. We currently use this to ensure that
+ * commit timestamps are monotonically increasing in their LSN
+ * order.
+ */
+typedef void (*XLogReserveInsertHookType)(void *data);
+extern XLogReserveInsertHookType XLogReserveInsertHook;
+extern void *XLogReserveInsertHookData;
+extern bool XLogReserveInsertHookModifiedRecord;
+
extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
XLogRecPtr fpw_lsn,
uint8 flags,
@@ -270,6 +281,13 @@ extern void SetInstallXLogFileSegmentActive(void);
extern bool IsInstallXLogFileSegmentActive(void);
extern void XLogShutdownWalRcv(void);
+/*
+ * Functions to access the last commit Lamport timestamp held in
+ * XLogCtl.
+ */
+extern TimestampTz XLogGetLastTransactionStopTimestamp(void);
+extern void XLogSetLastTransactionStopTimestamp(TimestampTz tz);
+
/*
* Routines to start, stop, and get status of a base backup.
*/