Re: Improve monitoring of shared memory allocations

Tomas Vondra Mon, 24 Mar 2025 07:41:27 -0700

Hi,

I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.


Most of my suggestions are about formatting and/or readability. Some of
the likes (e.g. the pointer arithmetics) got so long pgindent would have
to wrap them, but more importantly really hard to decipher what it does.

I've added a couple "review" commits, actually doing most of what I'm
going to suggest.


1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?


2) I find it more readable when ShmemInitStruct() is formatted on
multiple lines. Yes, it's a matter of choice, but also other places do
it this way, I think.


3) The changes in hash_create() are a good example of readability issues
I already mentioned. Consider this line:

curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize *
sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));

First, I'd point out this is not an offset but a pointer, so the
variable name is a bit misleading. But more importantly, I envy those
who can parse this in their head and see if it's correct.

I think it's much better to define a couple macros to calculate parts of
this, essentially a tiny "language" expressing this in a concise way.
The 0002 patch defines

- HASH_ELEMENTS
- HASH_ELEMENT_NEXT
- HASH_SEGMENT_PTR
- HASH_SEGMENT_SIZE
- ...

and then uses that in hash_create(). I believe it's way more readable
like this.


4) I find it a bit strange when a function calculates multiple values,
and then returns them in different ways - one as a return value, one
through a pointer, the way compute_buckets_and_segs() did.

There are cases when it makes sense (e.g. when one of the values is
returned only optionally), but otherwise I think it's better to return
both values in the same way through a pointer. The 0002 patch adjusts
compute_buckets_and_segs() to it like this.

We have a precedent for this - ExecChooseHashTableSize(), which is doing
a very similar thing for sizing hashjoin hash table.


5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
along with calculating the per-element requestSize, but then still does

    memset(PredXact->element, 0, requestSize);

and

    memset(RWConflictPool->element, 0, requestSize);

with requestSize for the whole allocation? I haven't seen any crashes,
but this seems wrong to me. I believe we can simply zero the whole
allocation right after ShmemInitStruct(), there's no need to do this for
each individual element.

5) InitProcGlobal is another example of hard-to-read code. Admittedly,
it wasn't particularly readable before, but making the lines even longer
does not make it much better ...

I guess adding a couple macros similar to hash_create() would be one way
to improve this (and there's even a review comment in that sense), but I
chose to just do maintain a simple pointer. But if you think it should
be done differently, feel free to do so.


6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
which seems to be doing a very similar thing, and to not require the
prototype. Minor detail, feel free to undo.


7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
the patch, and I see no reason to do it in the same commit. So 0005
removes this change, and 0006 reintroduces it separately.

FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.

Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?



regards

-- 
Tomas Vondra

From f527909dda02b4c7231db53a0fe6cecbaec55ca4 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed...@gmail.com>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH v20250324 1/6] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 209 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 161 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..d8aed0bfaaf 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..ba785f1d5f2 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start
+	 * of each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the the
+		 * first partition of elements
+		 */
+
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +637,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +734,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -741,22 +755,21 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +864,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements
+	 * are allocated only if they are less than nelem_alloc.
+	 * In any case, the init_size should be equal to the number of
+	 * elements added using element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1351,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1389,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1702,28 +1770,38 @@ seg_alloc(HTAB *hashp)
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
 
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1822,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2033,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..5e513c8116a 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+					long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

From d737c4033a92f070c3c3ad791fdef2c8e133394f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Sat, 22 Mar 2025 12:20:44 +0100
Subject: [PATCH v20250324 2/6] review

---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 112 ++++++++++++++++++++----------
 src/include/utils/hsearch.h       |   4 +-
 3 files changed, 81 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index d8aed0bfaaf..51ad68cc796 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -345,9 +345,11 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	infoP->alloc = ShmemAllocNoError;
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
+	/* review: I don't see a reason to rename this to _init_ */
+
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_init_size(infoP, hash_flags, init_size, 0),
+							   hash_get_shared_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index ba785f1d5f2..9792377d567 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,6 +260,30 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+/* review: shouldn't this have some MAXALIGNs? */
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
@@ -281,9 +305,16 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
-static int	compute_buckets_and_segs(long nelem, int *nbuckets,
-									 long num_partitions, long ssize);
-static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
+/*
+ * review: it's a bit weird to have a function that calculates two values,
+ * but returns one through a pointer and another as a regular retval. see
+ * ExecChooseHashTableSize for a better approach.
+ */
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -511,7 +542,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
-	/* Choose number of entries to allocate at a time */
+	/* Choose the number of entries to allocate at a time. */
 	nelem_batch = choose_nelem_alloc(info->entrysize);
 
 	/*
@@ -521,7 +552,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+		Size	size = hash_get_shared_size(info, flags, nelem, nelem_batch);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -572,6 +603,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
 	hctl->nelem_alloc = nelem_batch;
 
 	/* make local copies of heavily-used constant fields */
@@ -598,7 +630,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
-		void	   *curr_offset = NULL;
+		void	   *ptr = NULL;		/* XXX it's not offset, clearly */
 		int			nsegs;
 		int			nbuckets;
 
@@ -612,13 +644,18 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 			freelist_partitions = 1;
 
 		/*
-		 * If table is shared, calculate the offset at which to find the the
-		 * first partition of elements
+		 * For shared hash tables, we need to determine the offset for the
+		 * first partition of elements. We have to skip space for the header,
+		 * segments and buckets.
+		 *
+	 	 * XXX can we rely on this matching the calculation in hash_get_shared_size?
+		 * or could/should we add some asserts? Can we have at least some sanity checks
+		 * on nbuckets/nsegs?
 		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
 
-		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
-
-		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+		ptr = HASH_ELEMENTS(hashp, nsegs);
 
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
@@ -637,16 +674,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
-			HASHELEMENT *firstElement;
-			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
 			/*
-			 * Memory is allocated as part of initial allocation in
-			 * ShmemInitHash
+			 * Memory is allocated as part of allocation in ShmemInitHash, we
+			 * just need to split that allocation per-batch freelists.
+			 *
+			 * review: why do we invert the argument order?
 			 */
-			firstElement = (HASHELEMENT *) curr_offset;
-			curr_offset = (((char *) curr_offset) + (temp * elementSize));
-			element_add(hashp, firstElement, i, temp);
+			element_add(hashp, (HASHELEMENT *) ptr, i, temp);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -734,7 +770,8 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
@@ -755,22 +792,19 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
 	/* Allocate initial segments */
 	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
-								+ sizeof(HASHHDR)
-								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
-								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
-		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
-
-		i = i + 1;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
+	Assert(i == nsegs);
+
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
 			"TABLE POINTER   ", hashp,
@@ -862,9 +896,14 @@ hash_select_dirsize(long num_entries)
  * Compute the required initial memory allocation for a shared-memory
  * hashtable with the given parameters.  We need space for the HASHHDR
  * and for the (non expansible) directory.
+ *
+ * review: why rename this to "init"
+ * review: also, why adding the "const"?
+ * review: the comment should really explain the arguments, e.g. what
+ * is nelem_alloc for is not obvious?
  */
 Size
-hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
+hash_get_shared_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
 	int			nbuckets;
 	int			nsegs;
@@ -914,7 +953,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	else
 		ssize = DEF_SEGSIZE;
 
-	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
 
 	if (!element_alloc)
 		init_size = 0;
@@ -1791,6 +1831,7 @@ element_alloc(HTAB *hashp, int nelem)
 	return firstElement;
 }
 
+/* review: comment needed, how is the work divided between element_alloc and this? */
 static void
 element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
 {
@@ -2034,11 +2075,11 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 	}
 }
 
-static int
-compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+/* review: comment needed */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
 {
-	int			nsegs;
-
 	/*
 	 * Allocate space for the next greater power of two number of buckets,
 	 * assuming a desired maximum load factor of 1.
@@ -2058,7 +2099,6 @@ compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ss
 	/*
 	 * Figure number of directory segments needed, round up to a power of 2
 	 */
-	nsegs = ((*nbuckets) - 1) / ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-	return nsegs;
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
 }
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 5e513c8116a..e78ab38b4e8 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,8 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_init_size(const HASHCTL *info, int flags,
-					long init_size, int nelem_alloc);
+extern Size hash_get_shared_size(const HASHCTL *info, int flags,
+								 long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

From bd743402321ae01e27348250865f95290d37b842 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed...@gmail.com>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH v20250324 3/6] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..dd66990335b 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e6..65239d743da 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.49.0

From eac38c347318a0ee074b25d487de66575a242e12 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Sat, 22 Mar 2025 12:50:38 +0100
Subject: [PATCH v20250324 4/6] review

---
 src/backend/storage/lmgr/predicate.c | 21 +++++---
 src/backend/storage/lmgr/proc.c      | 75 +++++++++++++++++++---------
 2 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index dd66990335b..aeba84de786 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1237,6 +1237,9 @@ PredicateLockShmemInit(void)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1247,7 +1250,9 @@ PredicateLockShmemInit(void)
 		PredXact->HavePartialClearedThrough = 0;
 		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
+		/* XXX wasn't this actually wrong, considering requestSize is the whole
+		 * shmem allocation, including PredXactListDataSize? */
+		// memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,21 +1305,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
-	requestSize = mul_size((Size) max_table_size,
-						   RWConflictDataSize);
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize + requestSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 65239d743da..4afd7cd42c3 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,7 +88,6 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
-static Size PGProcShmemSize(void);
 
 
 /*
@@ -124,6 +123,27 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -177,6 +197,7 @@ InitProcGlobal(void)
 	Size		fpLockBitsSize,
 				fpRelIdSize;
 	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -208,7 +229,12 @@ InitProcGlobal(void)
 	 */
 	requestSize = PGProcShmemSize();
 
-	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr += TotalProcs * sizeof(PGPROC);
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -221,14 +247,27 @@ InitProcGlobal(void)
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * review: does the padding comment still make sense with PG_CACHE_LINE_SIZE?
+	 *         presumably that's the padding mentioned by the comment?
+	 *
+	 * review: those lines are too long / not comprehensible, let's define some
+	 *         macros to calculate stuff?
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -238,7 +277,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -335,26 +376,12 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-static Size
-PGProcShmemSize(void)
-{
-	Size		size;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
-
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	return size;
-}
-
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.49.0

From 123e403ed6bfca929b3a33afe210eacc38dd73ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Sat, 22 Mar 2025 15:22:08 +0100
Subject: [PATCH v20250324 5/6] remove cacheline

---
 src/backend/storage/lmgr/proc.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4afd7cd42c3..f7957eb008b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -136,11 +136,8 @@ PGProcShmemSize(void)
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	return size;
 }
 
@@ -256,15 +253,15 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids);
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates);
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags);
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0

From 8f54aefb9e3e83973d0a8adf6dc84fe876a0f858 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Sat, 22 Mar 2025 15:23:43 +0100
Subject: [PATCH v20250324 6/6] add cacheline padding back

---
 src/backend/storage/lmgr/proc.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f7957eb008b..e9c22f03f27 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -136,8 +136,11 @@ PGProcShmemSize(void)
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	return size;
 }
 
@@ -242,26 +245,22 @@ InitProcGlobal(void)
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
-	 *
-	 * review: does the padding comment still make sense with PG_CACHE_LINE_SIZE?
-	 *         presumably that's the padding mentioned by the comment?
+	 * review: shouldn't the first cacheline padding be right after "procs", before "xids"?
 	 *
 	 * review: those lines are too long / not comprehensible, let's define some
 	 *         macros to calculate stuff?
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr += TotalProcs * sizeof(*ProcGlobal->xids);
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates);
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags);
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0

Re: Improve monitoring of shared memory allocations

Reply via email to