Re: Improve monitoring of shared memory allocations

Tomas Vondra Thu, 27 Mar 2025 16:51:10 -0700


On 3/27/25 13:56, Tomas Vondra wrote:
> ...
> 
> OK, I don't have any other comments for 0001 and 0002.  I'll do some
> more review and polishing on those, and will get them committed soon.
>


Actually ... while polishing 0001 and 0002, I noticed a couple more
details that I'd like to ask about. I also ran pgindent and tweaked the
formatting, so some of the diff is caused by that.


1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Now, back to the patch. The code originally did this in ShmemInitStruct

    hashp = ShmemInitStruct(...)

to allocate the hctl, and then

    firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

    size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (sizeof(HASHHDR) + \
+        ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+        ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

to

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (MAXALIGN(sizeof(HASHHDR)) + \
+        ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+        ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))

First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

So something like this might be more correct:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+       (MAXALIGN(sizeof(HASHHDR)) + \
+        MAXALIGN((hctl)->dsize * izeof(HASHSEGMENT)) + \
+        MAXALIGN((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    CACHELINEALIGN(sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.

In fact, I'm starting to wonder if we need to worry about this at all.
Maybe it's not that significant for dynahash after all - otherwise we
actually should align/pad the individual elements, not just the array as
a whole, I think.


2) I do think the last patch should use CACHELINEALIGN() too, instead of
adding PG_CACHE_LINE_SIZE to the sizes (which does not ensure good
alignment, it just means there's gap between the allocated parts). But I
still don't know what the intention was, so hard to say ...


3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:

 * init_size should match the total number of elements allocated during
 * hash table creation, it could be zero for non-shared hash tables
 * depending on the value of nelem_alloc. For more explanation see
 * comments within this function.
 *
 * nelem_alloc parameter is not relevant for shared hash tables.

What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?

AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right? But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.

It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.

Then it says "nelem_alloc parameter is not relevant for shared hash
tables". What does "not relevant" mean? It should say it's ignored.

The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.



regards

-- 
Tomas Vondra

From a0ab96326687ff8eb9ff3fd8343cff545226fd0a Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed...@gmail.com>
Date: Thu, 27 Mar 2025 12:59:02 +0530
Subject: [PATCH v6 1/7] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 265 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 213 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..d8aed0bfaaf 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..1f215a16c51 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +537,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +625,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +638,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements.
+		 * We have to skip space for the header, segments and buckets.
+ 		 */
+		ptr = HASH_ELEMENTS(hashp, nsegs);
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -610,10 +666,17 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Assign the correct location of each parition within a
+			 * pre-allocated buffer.
+			 *
+			 * Actual memory allocation happens in ShmemInitHash for
+			 * shared hash tables or earlier in this function for non-shared
+			 * hash tables.
+			 * We just need to split that allocation per-batch freelists.
+			 */
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +764,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +782,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -847,15 +891,79 @@ hash_select_dirsize(long num_entries)
 
 /*
  * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * or non-shared memory hashtable with the given parameters.
+ * We need space for the HASHHDR, for the directory, segments and
+ * the init_size elements in buckets.	
+ * 
+ * For shared hash tables the directory size is non-expansive.
+ * 
+ * init_size should match the total number of elements allocated
+ * during hash table creation, it could be zero for non-shared hash
+ * tables depending on the value of nelem_alloc. For more explanation
+ * see comments within this function.
+ *
+ * nelem_alloc parameter is not relevant for shared hash tables.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true; /*Always true for shared hash tables */
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
+		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1393,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1431,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1810,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1867,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2078,35 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..79b959ffc3c 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0

From 5d504b93f4880c7d1836304b7ad1a212bd5a5c58 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 27 Mar 2025 20:52:31 +0100
Subject: [PATCH v6 2/7] review

---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 127 ++++++++++++++++--------------
 2 files changed, 70 insertions(+), 60 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index d8aed0bfaaf..7e03a371d22 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -347,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_init_size(infoP, hash_flags, init_size, 0),
+							   hash_get_init_size(infoP, hash_flags,
+												  init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 1f215a16c51..dcdc143e690 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,29 +260,35 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
-
-#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
-	(MAXALIGN(sizeof(HASHHDR)) + \
-	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
-	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
-
-#define HASH_ELEMENTS(hashp, nsegs) \
-	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
 
 #define HASH_SEGMENT_OFFSET(hctl, idx) \
-	(MAXALIGN(sizeof(HASHHDR)) + \
-	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
-	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
 
 #define HASH_SEGMENT_PTR(hashp, idx) \
-	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
 
-#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+/* XXX this is exactly the same as HASH_SEGMENT_OFFSET, unite? */
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
 
-#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
 
 #define HASH_ELEMENT_NEXT(hctl, num, ptr) \
-	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
 
 /*
  * Private function prototypes
@@ -305,8 +311,8 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
-static void	compute_buckets_and_segs(long nelem, long num_partitions,
-									 long ssize, /* segment size */
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
 									 int *nbuckets, int *nsegments);
 static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
 						int nelem, int freelist_idx);
@@ -547,7 +553,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -638,16 +644,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
-		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
-								 &nbuckets, &nsegs);
-
-		/*
-		 * Calculate the offset at which to find the first partition of
-		 * elements.
-		 * We have to skip space for the header, segments and buckets.
- 		 */
-		ptr = HASH_ELEMENTS(hashp, nsegs);
-
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -662,19 +658,29 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			/*
-			 * Assign the correct location of each parition within a
-			 * pre-allocated buffer.
-			 *
-			 * Actual memory allocation happens in ShmemInitHash for
-			 * shared hash tables or earlier in this function for non-shared
-			 * hash tables.
-			 * We just need to split that allocation per-batch freelists.
-			 */
 			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
 			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
@@ -789,14 +795,14 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = HASH_DIRECTORY(hashp);
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
 	/* Assign initial segments, which are also pre-allocated */
 	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
 		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
@@ -890,19 +896,22 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * or non-shared memory hashtable with the given parameters.
- * We need space for the HASHHDR, for the directory, segments and
- * the init_size elements in buckets.	
- * 
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
  * For shared hash tables the directory size is non-expansive.
- * 
- * init_size should match the total number of elements allocated
- * during hash table creation, it could be zero for non-shared hash
- * tables depending on the value of nelem_alloc. For more explanation
- * see comments within this function.
+ *
+ * init_size should match the total number of elements allocated during hash
+ * table creation, it could be zero for non-shared hash tables depending on the
+ * value of nelem_alloc. For more explanation see comments within this function.
  *
  * nelem_alloc parameter is not relevant for shared hash tables.
+ *
+ * XXX I find this comment hard to understand / confusing. So what is init_size?
+ * Why should it match the total number of elements allocated during hash table
+ * creation? What does "it could be zero" say? Should it be zero, or why would I
+ * want to make it zero in some cases?
  */
 Size
 hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
@@ -912,8 +921,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	int			num_partitions;
 	long		ssize;
 	long		dsize;
-	bool		element_alloc = true; /*Always true for shared hash tables */
-	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+	bool		element_alloc = true;	/* Always true for shared hash tables */
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
 
 	/*
 	 * For non-shared hash tables, the requested number of elements are
@@ -931,6 +940,7 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 		Assert(flags & HASH_DIRSIZE);
 		Assert(info->dsize == info->max_dsize);
 	}
+
 	/* Non-shared hash tables may not specify dir size */
 	if (!(flags & HASH_DIRSIZE))
 	{
@@ -961,8 +971,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	if (!element_alloc)
 		init_size = 0;
 
-	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
-		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
 		+ init_size * elementSize;
 }
 
@@ -1393,8 +1403,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		newElement = element_alloc(hashp, hctl->nelem_alloc);
-		if (newElement == NULL)
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1823,7 +1832,7 @@ element_alloc(HTAB *hashp, int nelem)
 		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
@@ -1834,7 +1843,7 @@ element_alloc(HTAB *hashp, int nelem)
 }
 
 /*
- * Link the elements allocated by element_alloc into the indicated free list
+ * link the elements allocated by element_alloc into the indicated free list
  */
 static void
 element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
@@ -1846,7 +1855,8 @@ element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
 	int			i;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+	elementSize = HASH_ELEMENT_SIZE(hctl);
+
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -2103,7 +2113,6 @@ compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
 	while ((*nbuckets) < num_partitions)
 		(*nbuckets) <<= 1;
 
-
 	/*
 	 * Figure number of directory segments needed, round up to a power of 2
 	 */
-- 
2.49.0

From 88974dc9a36759eaf2b0048b6ac45bc0c83cb537 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 27 Mar 2025 21:58:22 +0100
Subject: [PATCH v6 3/7] align using CACHELINESIZE, to match ShmemAllocRaw

---
 src/backend/utils/hash/dynahash.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index dcdc143e690..fc5575daaea 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -276,7 +276,7 @@ static long hash_accesses,
 
 /* XXX this is exactly the same as HASH_SEGMENT_OFFSET, unite? */
 #define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
-	(sizeof(HASHHDR) + \
+	CACHELINEALIGN(sizeof(HASHHDR) + \
 	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
 	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
 
@@ -971,9 +971,11 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	if (!element_alloc)
 		init_size = 0;
 
-	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
-		+ sizeof(HASHBUCKET) * ssize * nsegs
-		+ init_size * elementSize;
+	/* enforce elements to be cacheline aligned */
+	return CACHELINEALIGN(sizeof(HASHHDR)
+			+ (dsize * sizeof(HASHSEGMENT))
+			+ (ssize * nsegs * sizeof(HASHBUCKET)))
+			+ (init_size * elementSize);
 }
 
 
-- 
2.49.0

From edccce9feb8a930150d97d034ef96c6bff544d96 Mon Sep 17 00:00:00 2001
From: Rahila Syed <rahilasyed...@gmail.com>
Date: Thu, 27 Mar 2025 16:43:28 +0530
Subject: [PATCH v6 4/7] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 27 +++++++++-----
 src/backend/storage/lmgr/proc.c      | 56 +++++++++++++++++++++++-----
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..de2629fdf0c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,20 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1248,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1299,21 +1302,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e6..6ee48410b84 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,24 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +193,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +224,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +241,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +265,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +364,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.49.0

From 03809b97dca1a25b1b61f119ec05ef4b1d8e5b49 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 27 Mar 2025 21:20:26 +0100
Subject: [PATCH v6 5/7] review

---
 src/backend/storage/lmgr/predicate.c | 13 ++++++++-----
 src/backend/storage/lmgr/proc.c      | 21 +++++++++++----------
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index de2629fdf0c..d82114ffca1 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1229,6 +1229,7 @@ PredicateLockShmemInit(void)
 	requestSize = add_size(PredXactListDataSize,
 						   (mul_size((Size) max_table_size,
 									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
 							   requestSize,
 							   &found);
@@ -1237,7 +1238,7 @@ PredicateLockShmemInit(void)
 	{
 		int			i;
 
-		/* reset everything, both the header and the element */
+		/* clean everything, both the header and the element */
 		memset(PredXact, 0, requestSize);
 
 		dlist_init(&PredXact->availableList);
@@ -1248,7 +1249,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		for (i = 0; i < max_table_size; i++)
 		{
@@ -1302,9 +1304,10 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+
 	requestSize = RWConflictPoolHeaderDataSize +
-					mul_size((Size) max_table_size,
-							 RWConflictDataSize);
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
 									 requestSize,
@@ -1319,7 +1322,7 @@ PredicateLockShmemInit(void)
 
 		dlist_init(&RWConflictPool->availableList);
 		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
-			RWConflictPoolHeaderDataSize);
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		for (i = 0; i < max_table_size; i++)
 		{
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6ee48410b84..c08e3bb3d56 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -124,20 +124,21 @@ ProcGlobalShmemSize(void)
 }
 
 /*
- * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
- * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
- * a separate commit, not to mix it with the "monitoring improvement"
+ * Report shared-memory space needed by PGPROC.
  */
 static Size
 PGProcShmemSize(void)
 {
 	Size		size;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
 	return size;
 }
 
@@ -227,11 +228,11 @@ InitProcGlobal(void)
 	requestSize = PGProcShmemSize();
 
 	ptr = ShmemInitStruct("PGPROC structures",
-									   requestSize,
-									   &found);
+						  requestSize,
+						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -244,15 +245,15 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0

From 2a266c6115dca6ad32f7dbb136b63cf8d278504d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 27 Mar 2025 21:23:05 +0100
Subject: [PATCH v6 6/7] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c08e3bb3d56..6e0fc6a08e6 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -135,8 +135,11 @@ PGProcShmemSize(void)
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -232,7 +235,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -245,11 +248,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0

From 8f74e27118cc68c5348def34cfda814ca846db49 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 27 Mar 2025 22:03:23 +0100
Subject: [PATCH v6 7/7] review

---
 src/backend/storage/lmgr/proc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6e0fc6a08e6..5757545defb 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,12 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -235,7 +232,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -248,11 +245,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0

Re: Improve monitoring of shared memory allocations

Reply via email to