Re: Reducing planning time on tables with many indexes

David Geier Fri, 19 Aug 2022 06:04:07 -0700

On 8/1/22 15:33, David Geier wrote:

Hi Tom,


On Wed, Jul 27, 2022 at 7:15 PM Tom Lane <t...@sss.pgh.pa.us> wrote:

    I wrote:
    > Unfortunately, as things stand today, the planner needs more
    than the
    > right to look at the indexes' schemas, because it makes physical
    accesses
    > to btree indexes to find out their tree height (and I think
    there are some
    > comparable behaviors in other AMs).  I've never particularly
    cared for
    > that implementation, and would be glad to rip out that behavior
    if we can
    > find another way.  Maybe we could persuade VACUUM or ANALYZE to
    store that
    > info in the index's pg_index row, or some such, and then the planner
    > could use it with no lock?

It seems like _bt_getrootheight() first checks if the height is cachedand only if it isn't it accesses index meta pages.If the index locks are only taken for the sake of _bt_getrootheight()accessing index meta pages in case they are not cached, maybe theindex locks could be taken conditionally.However, postponing the call to where it is really needed sounds evenbetter.



    A first step here could just be to postpone fetching
    _bt_getrootheight()
    until we actually need it during cost estimation.  That would
    avoid the
    need to do it at all for indexes that indxpath.c discards as
    irrelevant,
    which is a decision made on considerably more information than the
    proposed patch uses.

Hi Tom,

I gave the idea of moving _bt_getrootheight() into costsize.c andfilling IndexOptInfo in get_relation_info() via syscache instead ofrelcache a try, but didn't get very far.Moving out _bt_getrootheight() was straightforward, and we should donevertheless. However, it seems like get_relation_info() stronglydepends on the index's Relation for quite some stuff. A fair amount offields I could actually fill from syscache, but there are some thateither need data not stored in syscache (e.g. estimate_rel_size(),Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or needfields that are cached in the index's Relation and would have to berecomputed otherwise (e.g. Relation::rd_indexprs filled byRelationGetIndexExpressions(), Relation::rd_indpred filled byRelationGetIndexPredicate()). Even if we could somehow obtain themissing info from somewhere, recomputing the otherwise cached fieldsfrom Relation would likely cause a significant slowdown in the serial case.

Beyond that I did some off-CPU profiling to precisely track down whichlock serializes execution. It turned out to be the MyProc::fpInfoLocklightweight lock. This lock is used in the fast path of the heavyweightlock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVEmode to (1) check if there is no other process holding a stronger lock,and if not, to reserve a process local fast path lock slot and (2) toreturn the fast path lock slots all in one go. To do so, the currentimplementation always linearly iterates over all lock slots. Thecorresponding call stacks are:


get_relation_info()               CommitTransaction()
  index_open()                      ResourceOwnerRelease()
    relation_open() ResourceOwnerReleaseInternal()
      LockRelationOid()                 ProcReleaseLocks()

LockAcquireExtended() LockReleaseAll() <-- calledtwice from ProcReleaseLocks()

          LWLockAcquire()

On top of that there are only 16 fast path lock slots. One slot isalways taken up by the parent relation, leaving only 15 slots for theindexes. As soon as a session process runs out of slots, it falls backto the normal lock path which has to mess around with the lock table. Todo so it also acquires a lightweight lock in LW_EXCLUSIVE mode. Thislightweight lock however is partitioned and therefore does not content.Hence, normal lock acquisition is slower but contents less.

To prove above findings I increased the number of fast path lock slotsper connection and optimized FastPathGrantRelationLock() andFastPathUnGrantRelationLock(). With these changes the lock contentiondisappeared and the workload scales linearly (the code I tested withalso included moving out _bt_getrootheight()):


| Parallel streams | TPS      | TPS / stream  |
|------------------|----------|---------------|
| 1                |   5,253  | 5,253         |
| 10               |  51,406  | 5,140         |
| 20               | 101,401  | 5,070         |
| 30               | 152,023  | 5,067         |
| 40               | 200,607  | 5,015         |
| 50               | 245,359  | 4,907         |
| 60               | 302,994  | 5,049         |

However, with the very same setup, the index filtering approach yields486k TPS with 60 streams and 9,827 TPS with a single stream. The singlestream number shows that this is not because it scales even better, butjust because less work is spent during planning. A quick perf sessionshowed that a fair amount of time is spent to get the relation sizes inblocks (RelationGetNumberOfBlocksInFork() -> lseek64()) and creatingindex paths (pull_varattnos() -> bms_add_member(), surprisingly).


-   32.20%     1.58%  postgres  postgres            [.]
get_relation_info
   - 30.62% get_relation_info
      - 16.56% RelationGetNumberOfBlocksInFork
         - 16.42% smgrnblocks
            - 16.25% mdnblocks
               - 16.10% _mdnblocks
                  + 15.55% __libc_lseek64
      + 5.83% index_open
      + 2.71% estimate_rel_size
        1.56% build_index_tlist
      + 1.22% palloc
   + 1.57% __libc_start_main
-   23.02%     0.03%  postgres  postgres            [.]
make_one_rel
   - 22.99% make_one_rel
      - 22.01% set_base_rel_pathlists
         - 21.99% set_rel_pathlist
            - 21.89% set_plain_rel_pathlist
               - 21.53% create_index_paths
                  - 18.76% get_index_paths
                     - 18.33% build_index_paths
                        - 15.77% check_index_only
                           - 14.75% pull_varattnos
                              - 14.58% pull_varattnos_walker
                                 - 13.05% expression_tree_walker
                                    - 9.50% pull_varattnos_walker
                                         5.77% bms_add_member
                                      0.93% bms_add_member
                                      0.52% expression_tree_walker
                                   1.44% pull_varattnos_walker
                        + 1.79% create_index_path
                  + 0.90% match_restriction_clauses_to_index
      + 0.95% set_base_rel_sizes

Given the findings above, the two patches are actually complementary.Optimizing the lock fast path not only helps when many indexes exist andonly a small subset is used, but whenever there are many locks used by aquery. The index filtering is another way to reduce lock contention, butbeyond that also greatly reduces the time spent on planning in theserial case.

I have attached the patch to improve the heavyweight lock fast path. Italso for now contains moving out _bt_getrootheight(). For workloadswhere the same set of locks is used over and over again, it only needson average a single loop iteration to find the relation (instead of alinear scan before). This allows to increase the number of fast pathlocks by a lot. In this patch I increased them from 16 to 64. The codecan be further improved for cases where to be locked relations changefrequently and therefore the chance of not finding a relation andbecause of that having to linearly search the whole array is higher.

I would really appreciate your feedback Tom, also on the questionsaround the approach of filtering out indexes, discussed in the last mails.



--
David Geier
(ServiceNow)

From 0b03fc09570635cd00248a5f4e3ab60e39b917fb Mon Sep 17 00:00:00 2001
From: David Geier <david.ge...@servicenow.com>
Date: Sun, 14 Aug 2022 20:17:54 +0000
Subject: [PATCH] Improve heavyweight lock fast path

---
 src/backend/optimizer/util/plancat.c |  9 ----
 src/backend/storage/lmgr/lock.c      | 71 +++++++++++++++-------------
 src/backend/utils/adt/selfuncs.c     |  8 ++++
 src/include/storage/proc.h           |  4 +-
 4 files changed, 48 insertions(+), 44 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5012bfe142..dfe99ba93b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -414,16 +414,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					info->tuples = rel->tuples;
 			}
 
-			if (info->relam == BTREE_AM_OID)
-			{
-				/* For btrees, get tree height while we have the index open */
-				info->tree_height = _bt_getrootheight(indexRelation);
-			}
-			else
-			{
-				/* For other index types, just set it to "unknown" for now */
 				info->tree_height = -1;
-			}
 
 			index_close(indexRelation, NoLock);
 
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f5803f681..b7139f51b5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -163,14 +163,6 @@ typedef struct TwoPhaseLockRecord
 } TwoPhaseLockRecord;
 
 
-/*
- * Count of the number of fast path lock slots we believe to be used.  This
- * might be higher than the real number if another backend has transferred
- * our locks to the primary lock table, but it can never be lower than the
- * real value, since only we can acquire locks on our own behalf.
- */
-static int	FastPathLocalUseCount = 0;
-
 /*
  * Flag to indicate if the relation extension lock is held by this backend.
  * This flag is used to ensure that while holding the relation extension lock
@@ -203,18 +195,18 @@ static bool IsPageLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
 #define FAST_PATH_MASK					((1 << FAST_PATH_BITS_PER_SLOT) - 1)
 #define FAST_PATH_GET_BITS(proc, n) \
-	(((proc)->fpLockBits >> (FAST_PATH_BITS_PER_SLOT * n)) & FAST_PATH_MASK)
+        proc->fpLockBits[n]
 #define FAST_PATH_BIT_POSITION(n, l) \
 	(AssertMacro((l) >= FAST_PATH_LOCKNUMBER_OFFSET), \
 	 AssertMacro((l) < FAST_PATH_BITS_PER_SLOT+FAST_PATH_LOCKNUMBER_OFFSET), \
 	 AssertMacro((n) < FP_LOCK_SLOTS_PER_BACKEND), \
 	 ((l) - FAST_PATH_LOCKNUMBER_OFFSET + FAST_PATH_BITS_PER_SLOT * (n)))
 #define FAST_PATH_SET_LOCKMODE(proc, n, l) \
-	 (proc)->fpLockBits |= UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)
+	 (proc)->fpLockBits[n] |= 1<<((l)-FAST_PATH_LOCKNUMBER_OFFSET)
 #define FAST_PATH_CLEAR_LOCKMODE(proc, n, l) \
-	 (proc)->fpLockBits &= ~(UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l))
+	 (proc)->fpLockBits[n]  &= ~(1 << ((l)-FAST_PATH_LOCKNUMBER_OFFSET))
 #define FAST_PATH_CHECK_LOCKMODE(proc, n, l) \
-	 ((proc)->fpLockBits & (UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)))
+	 ((proc)->fpLockBits[n] & (1 << ((l)-FAST_PATH_LOCKNUMBER_OFFSET)))
 
 /*
  * The fast-path lock mechanism is concerned only with relation locks on
@@ -924,8 +916,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 * lock type on a relation we have already locked using the fast-path, but
 	 * for now we don't worry about that case either.
 	 */
-	if (EligibleForRelationFastPath(locktag, lockmode) &&
-		FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
+	if (EligibleForRelationFastPath(locktag, lockmode))
 	{
 		uint32		fasthashcode = FastPathStrongLockHashPartition(hashcode);
 		bool		acquired;
@@ -940,8 +931,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		if (FastPathStrongRelationLocks->count[fasthashcode] != 0)
 			acquired = false;
 		else
-			acquired = FastPathGrantRelationLock(locktag->locktag_field2,
-												 lockmode);
+			acquired = FastPathGrantRelationLock(locktag->locktag_field2, lockmode);
 		LWLockRelease(&MyProc->fpInfoLock);
 		if (acquired)
 		{
@@ -2076,8 +2066,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	locallock->lockCleared = false;
 
 	/* Attempt fast release of any lock eligible for the fast path. */
-	if (EligibleForRelationFastPath(locktag, lockmode) &&
-		FastPathLocalUseCount > 0)
+	if (EligibleForRelationFastPath(locktag, lockmode))
 	{
 		bool		released;
 
@@ -2647,6 +2636,11 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
 	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
 }
 
+//static int ItersGrant = 0;
+//static int CallsGrant = 0;
+//static int ItersUngrant = 0;
+//static int CallsUngrant = 0;
+
 /*
  * FastPathGrantRelationLock
  *		Grant lock using per-backend fast-path array, if there is space.
@@ -2654,20 +2648,31 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
 static bool
 FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
 {
-	uint32		f;
+	uint32		f, i;
 	uint32		unused_slot = FP_LOCK_SLOTS_PER_BACKEND;
 
+        // if (CallsGrant % 100000 == 0 || CallsUngrant % 100000 == 0)
+        //    elog(WARNING, "counts: %d, %d, %d, %d, %f, %f", CallsGrant, CallsUngrant, ItersGrant, ItersUngrant,
+        //    (float)ItersGrant/(float)CallsGrant, (float)ItersUngrant/(float)CallsUngrant);
+
+        // CallsGrant++;
+
 	/* Scan for existing entry for this relid, remembering empty slot. */
-	for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+	for (i = 0; i < FP_LOCK_SLOTS_PER_BACKEND; i++)
 	{
-		if (FAST_PATH_GET_BITS(MyProc, f) == 0)
-			unused_slot = f;
-		else if (MyProc->fpRelId[f] == relid)
+                f = ((relid % FP_LOCK_SLOTS_PER_BACKEND) + i) % FP_LOCK_SLOTS_PER_BACKEND;
+		Assert(f < FP_LOCK_SLOTS_PER_BACKEND);
+		// ItersGrant++;
+
+		if (MyProc->fpRelId[f] == relid)
 		{
 			Assert(!FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode));
 			FAST_PATH_SET_LOCKMODE(MyProc, f, lockmode);
 			return true;
 		}
+                else if (FAST_PATH_GET_BITS(MyProc, f) == 0)
+			if (unused_slot == FP_LOCK_SLOTS_PER_BACKEND)
+				unused_slot = f;
 	}
 
 	/* If no existing entry, use any empty slot. */
@@ -2675,7 +2680,6 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
 	{
 		MyProc->fpRelId[unused_slot] = relid;
 		FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);
-		++FastPathLocalUseCount;
 		return true;
 	}
 
@@ -2691,24 +2695,25 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
 static bool
 FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode)
 {
-	uint32		f;
-	bool		result = false;
+	uint32		f, i;
 
-	FastPathLocalUseCount = 0;
-	for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+        // CallsUngrant++;
+
+	for (i = 0; i < FP_LOCK_SLOTS_PER_BACKEND; i++)
 	{
+                f = ((relid % FP_LOCK_SLOTS_PER_BACKEND) + i) % FP_LOCK_SLOTS_PER_BACKEND;
+		// ItersUngrant++;
+
 		if (MyProc->fpRelId[f] == relid
 			&& FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode))
 		{
 			Assert(!result);
 			FAST_PATH_CLEAR_LOCKMODE(MyProc, f, lockmode);
-			result = true;
-			/* we continue iterating so as to update FastPathLocalUseCount */
+			return true;
 		}
-		if (FAST_PATH_GET_BITS(MyProc, f) != 0)
-			++FastPathLocalUseCount;
 	}
-	return result;
+
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d35e5605de..dab1aaf195 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -97,6 +97,7 @@
 #include <ctype.h>
 #include <math.h>
 
+#include "access/nbtree.h"
 #include "access/brin.h"
 #include "access/brin_page.h"
 #include "access/gin.h"
@@ -6831,6 +6832,13 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * touched.  The number of such pages is btree tree height plus one (ie,
 	 * we charge for the leaf page too).  As above, charge once per SA scan.
 	 */
+        if (index->tree_height < 0)
+	{
+	    Relation indexRel = index_open(index->indexoid, AccessShareLock);
+	    index->tree_height = _bt_getrootheight(indexRel);
+	    index_close(indexRel, NoLock);
+	}
+
 	descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..736b03d4f8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -80,7 +80,7 @@ struct XidCache
  * rather than the main lock table.  This eases contention on the lock
  * manager LWLocks.  See storage/lmgr/README for additional details.
  */
-#define		FP_LOCK_SLOTS_PER_BACKEND 16
+#define		FP_LOCK_SLOTS_PER_BACKEND 64
 
 /*
  * An invalid pgprocno.  Must be larger than the maximum number of PGPROC
@@ -282,7 +282,7 @@ struct PGPROC
 
 	/* Lock manager data, recording fast-path locks taken by this backend. */
 	LWLock		fpInfoLock;		/* protects per-backend fast-path state */
-	uint64		fpLockBits;		/* lock modes held for each fast-path slot */
+	uint8		fpLockBits[FP_LOCK_SLOTS_PER_BACKEND];		/* lock modes held for each fast-path slot */
 	Oid			fpRelId[FP_LOCK_SLOTS_PER_BACKEND]; /* slots for rel oids */
 	bool		fpVXIDLock;		/* are we holding a fast-path VXID lock? */
 	LocalTransactionId fpLocalTransactionId;	/* lxid for fast-path VXID
-- 
2.25.1

Re: Reducing planning time on tables with many indexes

Reply via email to