Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

Yura Sokolov Fri, 09 Mar 2018 18:13:10 -0800

08.03.2018 03:42, Tomas Vondra пишет:
> On 03/06/2018 06:23 AM, Yura Sokolov wrote:
>> 05.03.2018 18:00, Tom Lane пишет:
>>> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>>>> Snapshots are static (we don't really add new XIDs into existing ones,
>>>> right?), so why don't we simply sort the XIDs and then use bsearch to
>>>> lookup values? That should fix the linear search, without need for any
>>>> local hash table.
>>>
>>> +1 for looking into that, since it would avoid adding any complication
>>> to snapshot copying.  In principle, with enough XIDs in the snap, an
>>> O(1) hash probe would be quicker than an O(log N) bsearch ... but I'm
>>> dubious that we are often in the range where that would matter.
>>> We do need to worry about the cost of snapshot copying, too.
>>
>> Sorting - is the first thing I've tried. Unfortunately, sorting
>> itself eats too much cpu. Filling hash table is much faster.
>>
> 
> I've been interested in the sort-based approach, so I've spent a bit of
> time hacking on it (patch attached). It's much less invasive compared to
> the hash-table, but Yura is right the hashtable gives better results.
> 
> I've tried to measure the benefits using the same script I shared on
> Tuesday, but I kept getting really strange numbers. In fact, I've been
> unable to even reproduce the results I shared back then. And after a bit
> of head-scratching I think the script is useless - it can't possibly
> generate snapshots with many XIDs because all the update threads sleep
> for exactly the same time. And first one to sleep is first one to wake
> up and commit, so most of the other XIDs are above xmax (and thus not
> included in the snapshot). I have no idea why I got the numbers :-/
> 
> But with this insight, I quickly modified the script to also consume
> XIDs by another thread (which simply calls txid_current). With that I'm
> getting snapshots with as many XIDs as there are UPDATE clients (this
> time I actually checked that using gdb).
> 
> And for a 60-second run the tps results look like this (see the attached
> chart, showing the same data):
> 
>     writers     master      hash       sort
>    -----------------------------------------
>     16           1068       1057       1070
>     32           1005        995       1033
>     64           1064       1042       1117
>     128          1058       1021       1051
>     256           977       1004        928
>     512           773        935        808
>     768           576        815        670
>     1000          413        752        573
> 
> The sort certainly does improve performance compared to master, but it's
> only about half of the hashtable improvement.
> 
> I don't know how much this simple workload resembles the YCSB benchmark,
> but I seem to recall it's touching individual rows. In which case it's
> likely worse due to the pg_qsort being more expensive than building the
> hash table.
> 
> So I agree with Yura the sort is not a viable alternative to the hash
> table, in this case.
> 
>> Can I rely on snapshots being static? May be then there is no need
>> in separate raw representation and hash table. I may try to fill hash
>> table directly in GetSnapshotData instead of lazy filling. Though it
>> will increase time under lock, so it is debatable and should be
>> carefully measured.
>>
> 
> Yes, I think you can rely on snapshots not being modified later. That's
> pretty much the definition of a snapshot.
> 
> You may do that in GetSnapshotData, but you certainly can't do that in
> the for loop there. Not only we don't want to add more work there, but
> you don't know the number of XIDs in that loop. And we don't want to
> build hash tables for small number of XIDs.
> 
> One reason against building the hash table in GetSnapshotData is that
> we'd build it even when the snapshot is never queried. Or when it is
> queried, but we only need to check xmin/xmax.


Thank you for analyze, Tomas.

Stephen is right about bug in snapmgr.c
Attached version fixes bug, and also simplifies XidInXip a bit.

With regards,
Sokolov Yura.

From 8484ae3f2c2f8af20ae8ce2f6d7960b6519e65c0 Mon Sep 17 00:00:00 2001
From: Sokolov Yura aka funny_falcon <funny.fal...@gmail.com>
Date: Fri, 9 Mar 2018 22:49:01 +0300
Subject: [PATCH] Make hash table for xip in XidInMVCCSnapshot

When lot of concurrent transactions attempts to update single
row, then linear scan through running list in XidInMVCCSnapshot
became noticebale bottleneck.

With this change, hash table is built on first search of xid in
snapshot->xip and snapshot->subxip arrays.

If size of array is smaller than 60, then linear scan is still
used, cause there is no much benefits from building hash then.
(at least on Intel Xeon).
---
 src/backend/storage/ipc/procarray.c | 67 ++++++++++++++++++++++------
 src/backend/utils/time/snapmgr.c    | 25 +++++++----
 src/backend/utils/time/tqual.c      | 88 ++++++++++++++++++++++++++++---------
 src/include/utils/snapshot.h        | 11 +++++
 4 files changed, 148 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index afe1c03aa3..2020b7482a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1469,6 +1469,54 @@ GetMaxSnapshotSubxidCount(void)
 	return TOTAL_MAX_CACHED_SUBXIDS;
 }
 
+/*
+ * ExtendXipSizeForHash - calculate xip array size with space for hash table.
+ *
+ * Hash table should be at least twice larger than array to not depend on
+ * cleverness of algorithm.
+ *
+ * But if xipcnt < SnapshotMinHash, then no need in hash-table at all.
+ */
+int
+ExtendXipSizeForHash(int xipcnt, uint8* plog)
+{
+	int size;
+	uint8 log = 0;
+
+	size = xipcnt;
+	if (xipcnt >= SnapshotMinHash)
+	{
+		log = 1;
+		while (xipcnt) {
+			log++;
+			xipcnt >>= 1;
+		}
+		size += 1<<log;
+	}
+	*plog = log;
+	return size;
+}
+
+/*
+ * AllocateXip - allocate xip array, extended for hash part if needed.
+ *
+ * Hash part will be used in tqual.c in XidInMVCCSnapshot (XidInXip).
+ */
+static TransactionId *
+AllocateXip(int max, uint8* plog)
+{
+	int size;
+	TransactionId *xip;
+
+	size = ExtendXipSizeForHash(max, plog);
+	xip = (TransactionId *) malloc(size * sizeof(TransactionId));
+	if (xip == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));
+	return xip;
+}
+
 /*
  * GetSnapshotData -- returns information about running transactions.
  *
@@ -1538,19 +1586,10 @@ GetSnapshotData(Snapshot snapshot)
 		 * First call for this snapshot. Snapshot is same size whether or not
 		 * we are in recovery, see later comments.
 		 */
-		snapshot->xip = (TransactionId *)
-			malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
-		if (snapshot->xip == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-		Assert(snapshot->subxip == NULL);
-		snapshot->subxip = (TransactionId *)
-			malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
-		if (snapshot->subxip == NULL)
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
+		snapshot->xip = AllocateXip(GetMaxSnapshotXidCount(),
+				&snapshot->xhlog);
+		snapshot->subxip = AllocateXip(GetMaxSnapshotSubxidCount(),
+				&snapshot->subxhlog);
 	}
 
 	/*
@@ -1758,6 +1797,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
+	snapshot->xhlog &= ~SnapshotHashBuilt;
+	snapshot->subxhlog &= ~SnapshotHashBuilt;
 
 	if (old_snapshot_threshold < 0)
 	{
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index e58c69dbd7..f8212bdd60 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -662,14 +662,16 @@ CopySnapshot(Snapshot snapshot)
 	Snapshot	newsnap;
 	Size		subxipoff;
 	Size		size;
+	int			xcnt, subxcnt;
+	uint8		xhlog, subxhlog;
 
 	Assert(snapshot != InvalidSnapshot);
 
+	xcnt = ExtendXipSizeForHash(snapshot->xcnt, &xhlog);
+	subxcnt = ExtendXipSizeForHash(snapshot->subxcnt, &subxhlog);
 	/* We allocate any XID arrays needed in the same palloc block. */
-	size = subxipoff = sizeof(SnapshotData) +
-		snapshot->xcnt * sizeof(TransactionId);
-	if (snapshot->subxcnt > 0)
-		size += snapshot->subxcnt * sizeof(TransactionId);
+	size = subxipoff = sizeof(SnapshotData) + xcnt * sizeof(TransactionId);
+	size += subxcnt * sizeof(TransactionId);
 
 	newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
 	memcpy(newsnap, snapshot, sizeof(SnapshotData));
@@ -677,6 +679,8 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->xhlog = xhlog;
+	newsnap->subxhlog = subxhlog;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2130,16 +2134,18 @@ RestoreSnapshot(char *start_address)
 	Size		size;
 	Snapshot	snapshot;
 	TransactionId *serialized_xids;
+	int			xcnt, subxcnt;
+	uint8		xhlog, subxhlog;
 
 	memcpy(&serialized_snapshot, start_address,
 		   sizeof(SerializedSnapshotData));
 	serialized_xids = (TransactionId *)
 		(start_address + sizeof(SerializedSnapshotData));
 
+	xcnt = ExtendXipSizeForHash(serialized_snapshot.xcnt, &xhlog);
+	subxcnt = ExtendXipSizeForHash(serialized_snapshot.subxcnt, &subxhlog);
 	/* We allocate any XID arrays needed in the same palloc block. */
-	size = sizeof(SnapshotData)
-		+ serialized_snapshot.xcnt * sizeof(TransactionId)
-		+ serialized_snapshot.subxcnt * sizeof(TransactionId);
+	size = sizeof(SnapshotData) + (xcnt + subxcnt) * sizeof(TransactionId);
 
 	/* Copy all required fields */
 	snapshot = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
@@ -2148,8 +2154,10 @@ RestoreSnapshot(char *start_address)
 	snapshot->xmax = serialized_snapshot.xmax;
 	snapshot->xip = NULL;
 	snapshot->xcnt = serialized_snapshot.xcnt;
+	snapshot->xhlog = xhlog;
 	snapshot->subxip = NULL;
 	snapshot->subxcnt = serialized_snapshot.subxcnt;
+	snapshot->subxhlog = subxhlog;
 	snapshot->suboverflowed = serialized_snapshot.suboverflowed;
 	snapshot->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
 	snapshot->curcid = serialized_snapshot.curcid;
@@ -2167,8 +2175,7 @@ RestoreSnapshot(char *start_address)
 	/* Copy SubXIDs, if present. */
 	if (serialized_snapshot.subxcnt > 0)
 	{
-		snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
-			serialized_snapshot.xcnt;
+		snapshot->subxip = ((TransactionId *) (snapshot + 1)) + xcnt;
 		memcpy(snapshot->subxip, serialized_xids + serialized_snapshot.xcnt,
 			   serialized_snapshot.subxcnt * sizeof(TransactionId));
 	}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f7c4c9188c..3fbd9327b3 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -1461,6 +1461,66 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 	return TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin);
 }
 
+/*
+ * XidInXip searches xid in xip array.
+ *
+ * If xcnt is smaller than SnapshotMinHash (60), or *xhlog is zero, then simple
+ * linear scan of xip array used. Looks like there is no benifit in hash table
+ * for small xip array.
+ *
+ * If xhlog points to non-zero byte, then space for hash table is allocated
+ * after xip array, and logarithm of its size is in xhlog.
+ * Most significant xhlog bit should be set if hash table is already built.
+ *
+ * Hash table is built using simple linear probing. It allows keep both
+ * insertion and lookup loop small.
+ */
+static inline bool
+XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt, uint8 *xhlog)
+{
+	TransactionId *xiphash;
+	uint32 i, pos, mask;
+	int found = 0;
+
+	if (*xhlog == 0)
+	{
+		/* full scan for small snapshots and if xiphash is not allocated */
+		for (i = 0; i < xcnt; i++)
+			if (TransactionIdEquals(xid, xip[i]))
+				return true;
+		return false;
+	}
+
+	mask = ((uint32)1 << (*xhlog & ~SnapshotHashBuilt)) - 1;
+	xiphash = xip + xcnt;
+
+	if (*xhlog & SnapshotHashBuilt)
+	{
+		/* Hash table already built. Do lookup */
+		pos = xid & mask;
+		while (xiphash[pos] != InvalidTransactionId)
+		{
+			if (TransactionIdEquals(xiphash[pos], xid))
+				return true;
+			pos = (pos + 1) & mask;
+		}
+		return false;
+	}
+
+	/* Build hash table. */
+	*xhlog |= SnapshotHashBuilt;
+	memset(xiphash, 0, (mask+1) * sizeof(TransactionId));
+	for (i = 0; i < xcnt; i++)
+	{
+		found |= TransactionIdEquals(xip[i], xid);
+		pos = xip[i] & mask;
+		while (xiphash[pos] != InvalidTransactionId)
+			pos = (pos + 1) & mask;
+		xiphash[pos] = xip[i];
+	}
+	return found != 0;
+}
+
 /*
  * XidInMVCCSnapshot
  *		Is the given XID still-in-progress according to the snapshot?
@@ -1474,8 +1534,6 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
-	uint32		i;
-
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
 	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
@@ -1507,13 +1565,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		if (!snapshot->suboverflowed)
 		{
 			/* we have full data, so search subxip */
-			int32		j;
-
-			for (j = 0; j < snapshot->subxcnt; j++)
-			{
-				if (TransactionIdEquals(xid, snapshot->subxip[j]))
-					return true;
-			}
+			if (XidInXip(xid, snapshot->subxip, snapshot->subxcnt, &snapshot->subxhlog))
+				return true;
 
 			/* not there, fall through to search xip[] */
 		}
@@ -1534,16 +1587,12 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 				return false;
 		}
 
-		for (i = 0; i < snapshot->xcnt; i++)
-		{
-			if (TransactionIdEquals(xid, snapshot->xip[i]))
-				return true;
-		}
+
+		if (XidInXip(xid, snapshot->xip, snapshot->xcnt, &snapshot->xhlog))
+			return true;
 	}
 	else
 	{
-		int32		j;
-
 		/*
 		 * In recovery we store all xids in the subxact array because it is by
 		 * far the bigger array, and we mostly don't know which xids are
@@ -1573,11 +1622,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		 * indeterminate xid. We don't know whether it's top level or subxact
 		 * but it doesn't matter. If it's present, the xid is visible.
 		 */
-		for (j = 0; j < snapshot->subxcnt; j++)
-		{
-			if (TransactionIdEquals(xid, snapshot->subxip[j]))
-				return true;
-		}
+		if (XidInXip(xid, snapshot->subxip, snapshot->subxcnt, &snapshot->subxhlog))
+			return true;
 	}
 
 	return false;
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index a8a5a8f4c0..332561e82b 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -23,6 +23,15 @@
 typedef struct SnapshotData *Snapshot;
 
 #define InvalidSnapshot		((Snapshot) NULL)
+/*
+ * Big running list will be converted into hash table on demand.
+ * SnapshotMinHash is threshold between "linear scan" and "hashtable on demand".
+ * SnapshotHashBuilt indicates that hash table was already built.
+ */
+#define SnapshotMinHash			(60)
+#define SnapshotHashBuilt		((uint8)0x80)
+/* Calculate size to hold both array and hash table (if needed) */
+int ExtendXipSizeForHash(int xipcnt, uint8* plog);
 
 /*
  * We use SnapshotData structures to represent both "regular" (MVCC)
@@ -78,6 +87,7 @@ typedef struct SnapshotData
 	 */
 	TransactionId *xip;
 	uint32		xcnt;			/* # of xact ids in xip[] */
+	uint8		xhlog;			/* log2 of allocated xip hash part */
 
 	/*
 	 * For non-historic MVCC snapshots, this contains subxact IDs that are in
@@ -90,6 +100,7 @@ typedef struct SnapshotData
 	 */
 	TransactionId *subxip;
 	int32		subxcnt;		/* # of xact ids in subxip[] */
+	uint8		subxhlog;		/* log2 of allocated subxip hash part */
 	bool		suboverflowed;	/* has the subxip array overflowed? */
 
 	bool		takenDuringRecovery;	/* recovery-shaped snapshot? */
-- 
2.14.1

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

Reply via email to