On 8/1/22 15:33, David Geier wrote:
Hi Tom,
On Wed, Jul 27, 2022 at 7:15 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
I wrote:
> Unfortunately, as things stand today, the planner needs more
than the
> right to look at the indexes' schemas, because it makes physical
accesses
> to btree indexes to find out their tree height (and I think
there are some
> comparable behaviors in other AMs). I've never particularly
cared for
> that implementation, and would be glad to rip out that behavior
if we can
> find another way. Maybe we could persuade VACUUM or ANALYZE to
store that
> info in the index's pg_index row, or some such, and then the planner
> could use it with no lock?
It seems like _bt_getrootheight() first checks if the height is cached
and only if it isn't it accesses index meta pages.
If the index locks are only taken for the sake of _bt_getrootheight()
accessing index meta pages in case they are not cached, maybe the
index locks could be taken conditionally.
However, postponing the call to where it is really needed sounds even
better.
A first step here could just be to postpone fetching
_bt_getrootheight()
until we actually need it during cost estimation. That would
avoid the
need to do it at all for indexes that indxpath.c discards as
irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.
Hi Tom,
I gave the idea of moving _bt_getrootheight() into costsize.c and
filling IndexOptInfo in get_relation_info() via syscache instead of
relcache a try, but didn't get very far.
Moving out _bt_getrootheight() was straightforward, and we should do
nevertheless. However, it seems like get_relation_info() strongly
depends on the index's Relation for quite some stuff. A fair amount of
fields I could actually fill from syscache, but there are some that
either need data not stored in syscache (e.g. estimate_rel_size(),
Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or need
fields that are cached in the index's Relation and would have to be
recomputed otherwise (e.g. Relation::rd_indexprs filled by
RelationGetIndexExpressions(), Relation::rd_indpred filled by
RelationGetIndexPredicate()). Even if we could somehow obtain the
missing info from somewhere, recomputing the otherwise cached fields
from Relation would likely cause a significant slowdown in the serial case.
Beyond that I did some off-CPU profiling to precisely track down which
lock serializes execution. It turned out to be the MyProc::fpInfoLock
lightweight lock. This lock is used in the fast path of the heavyweight
lock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVE
mode to (1) check if there is no other process holding a stronger lock,
and if not, to reserve a process local fast path lock slot and (2) to
return the fast path lock slots all in one go. To do so, the current
implementation always linearly iterates over all lock slots. The
corresponding call stacks are:
get_relation_info() CommitTransaction()
index_open() ResourceOwnerRelease()
relation_open() ResourceOwnerReleaseInternal()
LockRelationOid() ProcReleaseLocks()
LockAcquireExtended() LockReleaseAll() <-- called
twice from ProcReleaseLocks()
LWLockAcquire()
On top of that there are only 16 fast path lock slots. One slot is
always taken up by the parent relation, leaving only 15 slots for the
indexes. As soon as a session process runs out of slots, it falls back
to the normal lock path which has to mess around with the lock table. To
do so it also acquires a lightweight lock in LW_EXCLUSIVE mode. This
lightweight lock however is partitioned and therefore does not content.
Hence, normal lock acquisition is slower but contents less.
To prove above findings I increased the number of fast path lock slots
per connection and optimized FastPathGrantRelationLock() and
FastPathUnGrantRelationLock(). With these changes the lock contention
disappeared and the workload scales linearly (the code I tested with
also included moving out _bt_getrootheight()):
| Parallel streams | TPS | TPS / stream |
|------------------|----------|---------------|
| 1 | 5,253 | 5,253 |
| 10 | 51,406 | 5,140 |
| 20 | 101,401 | 5,070 |
| 30 | 152,023 | 5,067 |
| 40 | 200,607 | 5,015 |
| 50 | 245,359 | 4,907 |
| 60 | 302,994 | 5,049 |
However, with the very same setup, the index filtering approach yields
486k TPS with 60 streams and 9,827 TPS with a single stream. The single
stream number shows that this is not because it scales even better, but
just because less work is spent during planning. A quick perf session
showed that a fair amount of time is spent to get the relation sizes in
blocks (RelationGetNumberOfBlocksInFork() -> lseek64()) and creating
index paths (pull_varattnos() -> bms_add_member(), surprisingly).
- 32.20% 1.58% postgres postgres [.]
get_relation_info
- 30.62% get_relation_info
- 16.56% RelationGetNumberOfBlocksInFork
- 16.42% smgrnblocks
- 16.25% mdnblocks
- 16.10% _mdnblocks
+ 15.55% __libc_lseek64
+ 5.83% index_open
+ 2.71% estimate_rel_size
1.56% build_index_tlist
+ 1.22% palloc
+ 1.57% __libc_start_main
- 23.02% 0.03% postgres postgres [.]
make_one_rel
- 22.99% make_one_rel
- 22.01% set_base_rel_pathlists
- 21.99% set_rel_pathlist
- 21.89% set_plain_rel_pathlist
- 21.53% create_index_paths
- 18.76% get_index_paths
- 18.33% build_index_paths
- 15.77% check_index_only
- 14.75% pull_varattnos
- 14.58% pull_varattnos_walker
- 13.05% expression_tree_walker
- 9.50% pull_varattnos_walker
5.77% bms_add_member
0.93% bms_add_member
0.52% expression_tree_walker
1.44% pull_varattnos_walker
+ 1.79% create_index_path
+ 0.90% match_restriction_clauses_to_index
+ 0.95% set_base_rel_sizes
Given the findings above, the two patches are actually complementary.
Optimizing the lock fast path not only helps when many indexes exist and
only a small subset is used, but whenever there are many locks used by a
query. The index filtering is another way to reduce lock contention, but
beyond that also greatly reduces the time spent on planning in the
serial case.
I have attached the patch to improve the heavyweight lock fast path. It
also for now contains moving out _bt_getrootheight(). For workloads
where the same set of locks is used over and over again, it only needs
on average a single loop iteration to find the relation (instead of a
linear scan before). This allows to increase the number of fast path
locks by a lot. In this patch I increased them from 16 to 64. The code
can be further improved for cases where to be locked relations change
frequently and therefore the chance of not finding a relation and
because of that having to linearly search the whole array is higher.
I would really appreciate your feedback Tom, also on the questions
around the approach of filtering out indexes, discussed in the last mails.
--
David Geier
(ServiceNow)
From 0b03fc09570635cd00248a5f4e3ab60e39b917fb Mon Sep 17 00:00:00 2001
From: David Geier <david.ge...@servicenow.com>
Date: Sun, 14 Aug 2022 20:17:54 +0000
Subject: [PATCH] Improve heavyweight lock fast path
---
src/backend/optimizer/util/plancat.c | 9 ----
src/backend/storage/lmgr/lock.c | 71 +++++++++++++++-------------
src/backend/utils/adt/selfuncs.c | 8 ++++
src/include/storage/proc.h | 4 +-
4 files changed, 48 insertions(+), 44 deletions(-)
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5012bfe142..dfe99ba93b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -414,16 +414,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
info->tuples = rel->tuples;
}
- if (info->relam == BTREE_AM_OID)
- {
- /* For btrees, get tree height while we have the index open */
- info->tree_height = _bt_getrootheight(indexRelation);
- }
- else
- {
- /* For other index types, just set it to "unknown" for now */
info->tree_height = -1;
- }
index_close(indexRelation, NoLock);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f5803f681..b7139f51b5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -163,14 +163,6 @@ typedef struct TwoPhaseLockRecord
} TwoPhaseLockRecord;
-/*
- * Count of the number of fast path lock slots we believe to be used. This
- * might be higher than the real number if another backend has transferred
- * our locks to the primary lock table, but it can never be lower than the
- * real value, since only we can acquire locks on our own behalf.
- */
-static int FastPathLocalUseCount = 0;
-
/*
* Flag to indicate if the relation extension lock is held by this backend.
* This flag is used to ensure that while holding the relation extension lock
@@ -203,18 +195,18 @@ static bool IsPageLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
#define FAST_PATH_LOCKNUMBER_OFFSET 1
#define FAST_PATH_MASK ((1 << FAST_PATH_BITS_PER_SLOT) - 1)
#define FAST_PATH_GET_BITS(proc, n) \
- (((proc)->fpLockBits >> (FAST_PATH_BITS_PER_SLOT * n)) & FAST_PATH_MASK)
+ proc->fpLockBits[n]
#define FAST_PATH_BIT_POSITION(n, l) \
(AssertMacro((l) >= FAST_PATH_LOCKNUMBER_OFFSET), \
AssertMacro((l) < FAST_PATH_BITS_PER_SLOT+FAST_PATH_LOCKNUMBER_OFFSET), \
AssertMacro((n) < FP_LOCK_SLOTS_PER_BACKEND), \
((l) - FAST_PATH_LOCKNUMBER_OFFSET + FAST_PATH_BITS_PER_SLOT * (n)))
#define FAST_PATH_SET_LOCKMODE(proc, n, l) \
- (proc)->fpLockBits |= UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)
+ (proc)->fpLockBits[n] |= 1<<((l)-FAST_PATH_LOCKNUMBER_OFFSET)
#define FAST_PATH_CLEAR_LOCKMODE(proc, n, l) \
- (proc)->fpLockBits &= ~(UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l))
+ (proc)->fpLockBits[n] &= ~(1 << ((l)-FAST_PATH_LOCKNUMBER_OFFSET))
#define FAST_PATH_CHECK_LOCKMODE(proc, n, l) \
- ((proc)->fpLockBits & (UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)))
+ ((proc)->fpLockBits[n] & (1 << ((l)-FAST_PATH_LOCKNUMBER_OFFSET)))
/*
* The fast-path lock mechanism is concerned only with relation locks on
@@ -924,8 +916,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
* lock type on a relation we have already locked using the fast-path, but
* for now we don't worry about that case either.
*/
- if (EligibleForRelationFastPath(locktag, lockmode) &&
- FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
+ if (EligibleForRelationFastPath(locktag, lockmode))
{
uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode);
bool acquired;
@@ -940,8 +931,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
if (FastPathStrongRelationLocks->count[fasthashcode] != 0)
acquired = false;
else
- acquired = FastPathGrantRelationLock(locktag->locktag_field2,
- lockmode);
+ acquired = FastPathGrantRelationLock(locktag->locktag_field2, lockmode);
LWLockRelease(&MyProc->fpInfoLock);
if (acquired)
{
@@ -2076,8 +2066,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
locallock->lockCleared = false;
/* Attempt fast release of any lock eligible for the fast path. */
- if (EligibleForRelationFastPath(locktag, lockmode) &&
- FastPathLocalUseCount > 0)
+ if (EligibleForRelationFastPath(locktag, lockmode))
{
bool released;
@@ -2647,6 +2636,11 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
}
+//static int ItersGrant = 0;
+//static int CallsGrant = 0;
+//static int ItersUngrant = 0;
+//static int CallsUngrant = 0;
+
/*
* FastPathGrantRelationLock
* Grant lock using per-backend fast-path array, if there is space.
@@ -2654,20 +2648,31 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
static bool
FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
{
- uint32 f;
+ uint32 f, i;
uint32 unused_slot = FP_LOCK_SLOTS_PER_BACKEND;
+ // if (CallsGrant % 100000 == 0 || CallsUngrant % 100000 == 0)
+ // elog(WARNING, "counts: %d, %d, %d, %d, %f, %f", CallsGrant, CallsUngrant, ItersGrant, ItersUngrant,
+ // (float)ItersGrant/(float)CallsGrant, (float)ItersUngrant/(float)CallsUngrant);
+
+ // CallsGrant++;
+
/* Scan for existing entry for this relid, remembering empty slot. */
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ for (i = 0; i < FP_LOCK_SLOTS_PER_BACKEND; i++)
{
- if (FAST_PATH_GET_BITS(MyProc, f) == 0)
- unused_slot = f;
- else if (MyProc->fpRelId[f] == relid)
+ f = ((relid % FP_LOCK_SLOTS_PER_BACKEND) + i) % FP_LOCK_SLOTS_PER_BACKEND;
+ Assert(f < FP_LOCK_SLOTS_PER_BACKEND);
+ // ItersGrant++;
+
+ if (MyProc->fpRelId[f] == relid)
{
Assert(!FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode));
FAST_PATH_SET_LOCKMODE(MyProc, f, lockmode);
return true;
}
+ else if (FAST_PATH_GET_BITS(MyProc, f) == 0)
+ if (unused_slot == FP_LOCK_SLOTS_PER_BACKEND)
+ unused_slot = f;
}
/* If no existing entry, use any empty slot. */
@@ -2675,7 +2680,6 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
{
MyProc->fpRelId[unused_slot] = relid;
FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);
- ++FastPathLocalUseCount;
return true;
}
@@ -2691,24 +2695,25 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
static bool
FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode)
{
- uint32 f;
- bool result = false;
+ uint32 f, i;
- FastPathLocalUseCount = 0;
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ // CallsUngrant++;
+
+ for (i = 0; i < FP_LOCK_SLOTS_PER_BACKEND; i++)
{
+ f = ((relid % FP_LOCK_SLOTS_PER_BACKEND) + i) % FP_LOCK_SLOTS_PER_BACKEND;
+ // ItersUngrant++;
+
if (MyProc->fpRelId[f] == relid
&& FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode))
{
Assert(!result);
FAST_PATH_CLEAR_LOCKMODE(MyProc, f, lockmode);
- result = true;
- /* we continue iterating so as to update FastPathLocalUseCount */
+ return true;
}
- if (FAST_PATH_GET_BITS(MyProc, f) != 0)
- ++FastPathLocalUseCount;
}
- return result;
+
+ return false;
}
/*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d35e5605de..dab1aaf195 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -97,6 +97,7 @@
#include <ctype.h>
#include <math.h>
+#include "access/nbtree.h"
#include "access/brin.h"
#include "access/brin_page.h"
#include "access/gin.h"
@@ -6831,6 +6832,13 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* touched. The number of such pages is btree tree height plus one (ie,
* we charge for the leaf page too). As above, charge once per SA scan.
*/
+ if (index->tree_height < 0)
+ {
+ Relation indexRel = index_open(index->indexoid, AccessShareLock);
+ index->tree_height = _bt_getrootheight(indexRel);
+ index_close(indexRel, NoLock);
+ }
+
descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
costs.indexStartupCost += descentCost;
costs.indexTotalCost += costs.num_sa_scans * descentCost;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..736b03d4f8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -80,7 +80,7 @@ struct XidCache
* rather than the main lock table. This eases contention on the lock
* manager LWLocks. See storage/lmgr/README for additional details.
*/
-#define FP_LOCK_SLOTS_PER_BACKEND 16
+#define FP_LOCK_SLOTS_PER_BACKEND 64
/*
* An invalid pgprocno. Must be larger than the maximum number of PGPROC
@@ -282,7 +282,7 @@ struct PGPROC
/* Lock manager data, recording fast-path locks taken by this backend. */
LWLock fpInfoLock; /* protects per-backend fast-path state */
- uint64 fpLockBits; /* lock modes held for each fast-path slot */
+ uint8 fpLockBits[FP_LOCK_SLOTS_PER_BACKEND]; /* lock modes held for each fast-path slot */
Oid fpRelId[FP_LOCK_SLOTS_PER_BACKEND]; /* slots for rel oids */
bool fpVXIDLock; /* are we holding a fast-path VXID lock? */
LocalTransactionId fpLocalTransactionId; /* lxid for fast-path VXID
--
2.25.1