Adding basic NUMA awareness

Tomas Vondra Tue, 01 Jul 2025 12:07:33 -0700

Hi,

This is a WIP version of a patch series I'm working on, adding some
basic NUMA awareness for a couple parts of our shared memory (shared
buffers, etc.). It's based on Andres' experimental patches he spoke
about at pgconf.eu 2024 [1], and while it's improved and polished in
various ways, it's still experimental.

But there's a recent thread aiming to do something similar [2], so
better to share it now so that we can discuss both approaches. This
patch set is a bit more ambitious, handling NUMA in a way to allow
smarter optimizations later, so I'm posting it in a separate thread.

The series is split into patches addressing different parts of the
shared memory, starting (unsurprisingly) from shared buffers, then
buffer freelists and ProcArray. There's a couple additional parts, but
those are smaller / addressing miscellaneous stuff.

Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.

Each patch should have a commit message explaining the intent and
implementation, and then also detailed comments explaining various
challenges and open questions.

But let me go over the basics, and discuss some of the design choices
and open questions that need solving.

1) v1-0001-NUMA-interleaving-buffers.patch

This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.

Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).

So the patch handles this by explicitly mapping chunks of shared buffers
to different nodes - a bit like interleaving, but in larger chunks.
Ideally each node gets (1/N) of shared buffers, as a contiguous chunk.

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.

The patch is fairly simple. Most of the complexity is about picking the
chunk size, and aligning the arrays (so that it nicely aligns with
memory pages).

The patch has a GUC "numa_buffers_interleave", with "off" by default.

2) v1-0002-NUMA-localalloc.patch

This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

The patch has a GUC "numa_localalloc", with "off" by default.

3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch

Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.

4) v1-0004-NUMA-partition-buffer-freelist.patch

Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:

* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.

* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.

* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.

* none - nothing, sigle freelist

Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.

The main challenge is that with multiple smaller lists, a process can't
really use the whole shared buffers. So a single backed will only use
part of the memory. The more lists there are, the worse this effect is.
This is also why I think we won't use the other partitioning options,
because there's going to be more CPUs than NUMA nodes.

Obviously, this needs solving even with NUMA nodes - we need to allow a
single backend to utilize the whole shared buffers if needed. There
should be a way to "steal" buffers from other freelists (if the
"regular" freelist is empty), but the patch does not implement this.
Shouldn't be hard, I think.

The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).

I do have a separate experimental patch doing something like that, I
need to make it part of this branch.

5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

The fast-path arrays are less of a problem, because those tend to be
larger, and are accessed through pointers, so we can just adjust that.

So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.

This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.

But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.

There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.

There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).

This has a GUC "numa_procs_interleave", again "off" by default. It's not
quite correct, though, because the partitioning happens always. It only
affects the PGPROC lookup. (In a way, this may be a bit broken.)

6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch

This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.

Driven by GUC "numa_procs_pin" (default: off).

Summary
-------

So this is what I have at the moment. I've tried to organize the patches
in the order of importance, but that's just my guess. It's entirely
possible there's something I missed, some other order might make more
sense, etc.

There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).

I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).

regards

[1] https://www.youtube.com/watch?v=V75KpACdl6E

[2]
https://www.postgresql.org/message-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw%40mail.gmail.com

--
Tomas Vondra

From 9712e50d6d15c18ea2c5fcf457972486b0d4ef53 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Tue, 6 May 2025 21:12:21 +0200
Subject: [PATCH v1 1/6] NUMA: interleaving buffers

Ensure shared buffers are allocated from all NUMA nodes, in a balanced
way, instead of just using the node where Postgres initially starts, or
where the kernel decides to migrate the page, etc. With pre-warming
performed by a single backend, this can easily result in severely
unbalanced memory distribution (with most from a single NUMA node).

The kernel would eventually move some of the memory to other nodes
(thanks to zone_reclaim), but that tends to take a long time. So this
patch improves predictability, reduces the time needed for warmup
during benchmarking, etc.  It's less dependent on what the CPU
scheduler does, etc.

Furthermore, the buffers are mapped to NUMA nodes in a deterministic
way, so this also allows further improvements like backends using
buffers from the same NUMA node.

The effect is similar to

     numactl --interleave=all

but there's a number of important differences.

Firstly, it's applied only to shared buffers (and also to descriptors),
not to the whole shared memory segment. It's not clear we'd want to use
interleaving for all parts, storing entries with different sizes and
life cycles (e.g. ProcArray may need different approach).

Secondly, it considers the page and block size, and makes sure not to
split a buffer on different NUMA nodes (which with the regular
interleaving is guaranteed to happen, unless when using huge pages). The
patch performs "explicit" interleaving, so that buffers are not split
like this.

The patch maps both buffers and buffer descriptors, so that the buffer
and it's buffer descriptor end up on the same NUMA node.

The mapping happens in larger chunks (see choose_chunk_items). This is
required to handle buffer descriptors (which are smaller than buffers),
and it should also help to reduce the number of mappings. Most NUMA
systems will use 1GB chunks, unless using very small shared buffers.

Notes:

* The feature is enabled by numa_buffers_interleave GUC (false by default)

* It's not clear we want to enable interleaving for all shared memory.
  We probably want that for shared buffers, but maybe not for ProcArray
  or freelists.

* Similar questions are about huge pages - in general it's a good idea,
  but maybe it's not quite good for ProcArray. It's somewhate separate
  from NUMA, but not entirely because NUMA works on page granularity.
  PGPROC entries are ~8KB, so too large for interleaving with 4K pages,
  as we don't want to split the entry to multiple nodes. But could be
  done explicitly, by specifying which node to use for the pages.

* We could partition ProcArray, with one partition per NUMA node, and
  then at connection time pick a node from the same node. The process
  could migrate to some other node later, especially for long-lived
  connections, but there's no perfect solution, Maybe we could set
  affinity to cores from the same node, or something like that?
---
 src/backend/storage/buffer/buf_init.c | 384 +++++++++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c   |   1 +
 src/backend/utils/init/globals.c      |   3 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/bin/pgbench/pgbench.c             |  67 ++---
 src/include/miscadmin.h               |   2 +
 src/include/storage/bufmgr.h          |   1 +
 7 files changed, 427 insertions(+), 41 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..2ad34624c49 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -14,9 +14,17 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
+#include "port/pg_numa.h"
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
@@ -25,6 +33,19 @@ WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
 
+static Size get_memory_page_size(void);
+static int64 choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes);
+static void pg_numa_interleave_memory(char *startptr, char *endptr,
+									  Size mem_page_size, Size chunk_size,
+									  int num_nodes);
+
+/* number of buffers allocated on the same NUMA node */
+static int64 numa_chunk_buffers = -1;
+
+/* number of NUMA nodes (as returned by numa_num_configured_nodes) */
+static int	numa_nodes = -1;
+
+
 /*
  * Data Structures:
  *		buffers live in a freelist and a lookup data structure.
@@ -71,18 +92,80 @@ BufferManagerShmemInit(void)
 				foundDescs,
 				foundIOCV,
 				foundBufCkpt;
+	Size		mem_page_size;
+	Size		buffer_align;
+
+	/*
+	 * XXX A bit weird. Do we need to worry about postmaster? Could this even
+	 * run outside postmaster? I don't think so.
+	 *
+	 * XXX Another issue is we may get different values than when sizing the
+	 * the memory, because at that point we didn't know if we get huge pages,
+	 * so we assumed we will. Shouldn't cause crashes, but we might allocate
+	 * shared memory and then not use some of it (because of the alignment
+	 * that we don't actually need). Not sure about better way, good for now.
+	 */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
+
+	/*
+	 * With NUMA we need to ensure the buffers are properly aligned not just
+	 * to PG_IO_ALIGN_SIZE, but also to memory page size, because NUMA works
+	 * on page granularity, and we don't want a buffer to get split to
+	 * multiple nodes (when using multiple memory pages).
+	 *
+	 * We also don't want to interfere with other parts of shared memory,
+	 * which could easily happen with huge pages (e.g. with data stored before
+	 * buffers).
+	 *
+	 * We do this by aligning to the larger of the two values (we know both
+	 * are power-of-two values, so the larger value is automatically a
+	 * multiple of the lesser one).
+	 *
+	 * XXX Maybe there's a way to use less alignment?
+	 *
+	 * XXX Maybe with (mem_page_size > PG_IO_ALIGN_SIZE), we don't need to
+	 * align to mem_page_size? Especially for very large huge pages (e.g. 1GB)
+	 * that doesn't seem quite worth it. Maybe we should simply align to
+	 * BLCKSZ, so that buffers don't get split? Still, we might interfere with
+	 * other stuff stored in shared memory that we want to allocate on a
+	 * particular NUMA node (e.g. ProcArray).
+	 *
+	 * XXX Maybe with "too large" huge pages we should just not do this, or
+	 * maybe do this only for sufficiently large areas (e.g. shared buffers,
+	 * but not ProcArray).
+	 */
+	buffer_align = Max(mem_page_size, PG_IO_ALIGN_SIZE);
+
+	/* one page is a multiple of the other */
+	Assert(((mem_page_size % PG_IO_ALIGN_SIZE) == 0) ||
+		   ((PG_IO_ALIGN_SIZE % mem_page_size) == 0));
 
-	/* Align descriptors to a cacheline boundary. */
+	/*
+	 * Align descriptors to a cacheline boundary, and memory page.
+	 *
+	 * We want to distribute both to NUMA nodes, so that each buffer and it's
+	 * descriptor are on the same NUMA node. So we align both the same way.
+	 *
+	 * XXX The memory page is always larger than cacheline, so the cacheline
+	 * reference is a bit unnecessary.
+	 *
+	 * XXX In principle we only need to do this with NUMA, otherwise we could
+	 * still align just to cacheline, as before.
+	 */
 	BufferDescriptors = (BufferDescPadded *)
-		ShmemInitStruct("Buffer Descriptors",
-						NBuffers * sizeof(BufferDescPadded),
-						&foundDescs);
+		TYPEALIGN(buffer_align,
+				  ShmemInitStruct("Buffer Descriptors",
+								  NBuffers * sizeof(BufferDescPadded) + buffer_align,
+								  &foundDescs));
 
 	/* Align buffer pool on IO page size boundary. */
 	BufferBlocks = (char *)
-		TYPEALIGN(PG_IO_ALIGN_SIZE,
+		TYPEALIGN(buffer_align,
 				  ShmemInitStruct("Buffer Blocks",
-								  NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+								  NBuffers * (Size) BLCKSZ + buffer_align,
 								  &foundBufs));
 
 	/* Align condition variables to cacheline boundary. */
@@ -112,6 +195,63 @@ BufferManagerShmemInit(void)
 	{
 		int			i;
 
+		/*
+		 * Assign chunks of buffers and buffer descriptors to the available
+		 * NUMA nodes. We can't use the regular interleaving, because with
+		 * regular memory pages (smaller than BLCKSZ) we'd split all buffers
+		 * to multiple NUMA nodes. And we don't want that.
+		 *
+		 * But even with huge pages it seems like a good idea to not have
+		 * mapping for each page.
+		 *
+		 * So we always assign a larger contiguous chunk of buffers to the
+		 * same NUMA node, as calculated by choose_chunk_buffers(). We try to
+		 * keep the chunks large enough to work both for buffers and buffer
+		 * descriptors, but not too large. See the comments at
+		 * choose_chunk_buffers() for details.
+		 *
+		 * Thanks to the earlier alignment (to memory page etc.), we know the
+		 * buffers won't get split, etc.
+		 *
+		 * This also makes it easier / straightforward to calculate which NUMA
+		 * node a buffer belongs to (it's a matter of divide + mod). See
+		 * BufferGetNode().
+		 */
+		if (numa_buffers_interleave)
+		{
+			char	   *startptr,
+					   *endptr;
+			Size		chunk_size;
+
+			numa_nodes = numa_num_configured_nodes();
+
+			numa_chunk_buffers
+				= choose_chunk_buffers(NBuffers, mem_page_size, numa_nodes);
+
+			elog(LOG, "BufferManagerShmemInit num_nodes %d chunk_buffers %ld",
+				 numa_nodes, numa_chunk_buffers);
+
+			/* first map buffers */
+			startptr = BufferBlocks;
+			endptr = startptr + ((Size) NBuffers) * BLCKSZ;
+			chunk_size = (numa_chunk_buffers * BLCKSZ);
+
+			pg_numa_interleave_memory(startptr, endptr,
+									  mem_page_size,
+									  chunk_size,
+									  numa_nodes);
+
+			/* now do the same for buffer descriptors */
+			startptr = (char *) BufferDescriptors;
+			endptr = startptr + ((Size) NBuffers) * sizeof(BufferDescPadded);
+			chunk_size = (numa_chunk_buffers * sizeof(BufferDescPadded));
+
+			pg_numa_interleave_memory(startptr, endptr,
+									  mem_page_size,
+									  chunk_size,
+									  numa_nodes);
+		}
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -144,6 +284,11 @@ BufferManagerShmemInit(void)
 		GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
 	}
 
+	/*
+	 * As this point we have all the buffers in a single long freelist. With
+	 * freelist partitioning we rebuild them in StrategyInitialize.
+	 */
+
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
@@ -152,24 +297,72 @@ BufferManagerShmemInit(void)
 						 &backend_flush_after);
 }
 
+/*
+ * Determine the size of memory page.
+ *
+ * XXX This is a bit tricky, because the result depends at which point we call
+ * this. Before the allocation we don't know if we succeed in allocating huge
+ * pages - but we have to size everything for the chance that we will. And then
+ * if the huge pages fail (with 'huge_pages=try'), we'll use the regular memory
+ * pages. But at that point we can't adjust the sizing.
+ *
+ * XXX Maybe with huge_pages=try we should do the sizing twice - first with
+ * huge pages, and if that fails, then without them. But not for this patch.
+ * Up to this point there was no such dependency on huge pages.
+ */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status != HUGE_PAGES_OFF)
+		GetHugePageSize(&huge_page_size, NULL);
+	else
+		huge_page_size = 0;
+
+	return Max(os_page_size, huge_page_size);
+}
+
 /*
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
  * data pages, buffer descriptors, hash tables, etc.
+ *
+ * XXX Called before allocation, so we don't know if huge pages get used yet.
+ * So we need to assume huge pages get used, and use get_memory_page_size()
+ * to calculate the largest possible memory page.
  */
 Size
 BufferManagerShmemSize(void)
 {
 	Size		size = 0;
+	Size		mem_page_size;
+
+	/* XXX why does IsUnderPostmaster matter? */
+	if (IsUnderPostmaster)
+		mem_page_size = pg_get_shmem_pagesize();
+	else
+		mem_page_size = get_memory_page_size();
 
 	/* size of buffer descriptors */
 	size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
 	/* to allow aligning buffer descriptors */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
 
 	/* size of data pages, plus alignment padding */
-	size = add_size(size, PG_IO_ALIGN_SIZE);
+	size = add_size(size, Max(mem_page_size, PG_IO_ALIGN_SIZE));
 	size = add_size(size, mul_size(NBuffers, BLCKSZ));
 
 	/* size of stuff controlled by freelist.c */
@@ -186,3 +379,178 @@ BufferManagerShmemSize(void)
 
 	return size;
 }
+
+/*
+ * choose_chunk_buffers
+ *		choose the number of buffers allocated to a NUMA node at once
+ *
+ * We don't map shared buffers to NUMA nodes one by one, but in larger chunks.
+ * This is both for efficiency reasons (fewer mappings), and also because we
+ * want to map buffer descriptors too - and descriptors are much smaller. So
+ * we pick a number that's high enough for descriptors to use whole pages.
+ *
+ * We also want to keep buffers somehow evenly distributed on nodes, with
+ * about NBuffers/nodes per node. So we don't use chunks larger than this,
+ * to keep it as fair as possible (the chunk size is a possible difference
+ * between memory allocated to different NUMA nodes).
+ *
+ * It's possible shared buffers are so small this is not possible (i.e.
+ * it's less than chunk_size). But sensible NUMA systems will use a lot
+ * of memory, so this is unlikely.
+ *
+ * We simply print a warning about the misbalance, and that's it.
+ *
+ * XXX It'd be good to ensure the chunk size is a power-of-2, because then
+ * we could calculate the NUMA node simply by shift/modulo, while now we
+ * have to do a division. But we don't know how many buffers and buffer
+ * descriptors fits into a memory page. It may not be a power-of-2.
+ */
+static int64
+choose_chunk_buffers(int NBuffers, Size mem_page_size, int num_nodes)
+{
+	int64		num_items;
+	int64		max_items;
+
+	/* make sure the chunks will align nicely */
+	Assert(BLCKSZ % sizeof(BufferDescPadded) == 0);
+	Assert(mem_page_size % sizeof(BufferDescPadded) == 0);
+	Assert(((BLCKSZ % mem_page_size) == 0) || ((mem_page_size % BLCKSZ) == 0));
+
+	/*
+	 * The minimum number of items to fill a memory page with descriptors and
+	 * blocks. The NUMA allocates memory in pages, and we need to do that for
+	 * both buffers and descriptors.
+	 *
+	 * In practice the BLCKSZ doesn't really matter, because it's much larger
+	 * than BufferDescPadded, so the result is determined buffer descriptors.
+	 * But it's clearer this way.
+	 */
+	num_items = Max(mem_page_size / sizeof(BufferDescPadded),
+					mem_page_size / BLCKSZ);
+
+	/*
+	 * We shouldn't use chunks larger than NBuffers/num_nodes, because with
+	 * larger chunks the last NUMA node would end up with much less memory (or
+	 * no memory at all).
+	 */
+	max_items = (NBuffers / num_nodes);
+
+	/*
+	 * Did we already exceed the maximum desirable chunk size? That is, will
+	 * the last node get less than one whole chunk (or no memory at all)?
+	 */
+	if (num_items > max_items)
+		elog(WARNING, "choose_chunk_buffers: chunk items exceeds max (%ld > %ld)",
+			 num_items, max_items);
+
+	/* grow the chunk size until we hit the max limit. */
+	while (2 * num_items <= max_items)
+		num_items *= 2;
+
+	/*
+	 * XXX It's not difficult to construct cases where we end up with not
+	 * quite balanced distribution. For example, with shared_buffers=10GB and
+	 * 4 NUMA nodes, we end up with 2GB chunks, which means the first node
+	 * gets 4GB, and the three other nodes get 2GB each.
+	 *
+	 * We could be smarter, and try to get more balanced distribution. We
+	 * could simply reduce max_items e.g. to
+	 *
+	 * max_items = (NBuffers / num_nodes) / 4;
+	 *
+	 * in which cases we'd end up with 512MB chunks, and each nodes would get
+	 * the same 2.5GB chunk. It may not always work out this nicely, but it's
+	 * better than with (NBuffers / num_nodes).
+	 *
+	 * Alternatively, we could "backtrack" - try with the large max_items,
+	 * check how balanced it is, and if it's too imbalanced, try with a
+	 * smaller one.
+	 *
+	 * We however want a simple scheme.
+	 */
+
+	return num_items;
+}
+
+/*
+ * Calculate the NUMA node for a given buffer.
+ */
+int
+BufferGetNode(Buffer buffer)
+{
+	/* not NUMA interleaving */
+	if (numa_chunk_buffers == -1)
+		return -1;
+
+	return (buffer / numa_chunk_buffers) % numa_nodes;
+}
+
+/*
+ * pg_numa_interleave_memory
+ *		move memory to different NUMA nodes in larger chunks
+ *
+ * startptr - start of the region (should be aligned to page size)
+ * endptr - end of the region (doesn't need to be aligned)
+ * mem_page_size - size of the memory page size
+ * chunk_size - size of the chunk to move to a single node (should be multiple
+ *              of page size
+ * num_nodes - number of nodes to allocate memory to
+ *
+ * XXX Maybe this should use numa_tonode_memory and numa_police_memory instead?
+ * That might be more efficient than numa_move_pages, as it works on larger
+ * chunks of memory, not individual system pages, I think.
+ *
+ * XXX The "interleave" name is not quite accurate, I guess.
+ */
+static void
+pg_numa_interleave_memory(char *startptr, char *endptr,
+						  Size mem_page_size, Size chunk_size,
+						  int num_nodes)
+{
+	volatile uint64 touch pg_attribute_unused();
+	char	   *ptr = startptr;
+
+	/* chunk size has to be a multiple of memory page */
+	Assert((chunk_size % mem_page_size) == 0);
+
+	/*
+	 * Walk the memory pages in the range, and determine the node for each
+	 * one. We use numa_tonode_memory(), because then we can move a whole
+	 * memory range to the node, we don't need to worry about individual pages
+	 * like with numa_move_pages().
+	 */
+	while (ptr < endptr)
+	{
+		/* We may have an incomplete chunk at the end. */
+		Size		sz = Min(chunk_size, (endptr - ptr));
+
+		/*
+		 * What NUMA node does this range belong to? Each chunk should go to
+		 * the same NUMA node, in a round-robin manner.
+		 */
+		int			node = ((ptr - startptr) / chunk_size) % num_nodes;
+
+		/*
+		 * Nope, we have the first buffer from the next memory page, and we'll
+		 * set NUMA node for it (and all pages up to the next buffer). The
+		 * buffer should align with the memory page, thanks to the
+		 * buffer_align earlier.
+		 */
+		Assert((int64) ptr % mem_page_size == 0);
+		Assert((sz % mem_page_size) == 0);
+
+		/*
+		 * XXX no return value, to make this fail on error, has to use
+		 * numa_set_strict
+		 *
+		 * XXX Should we still touch the memory first, like with numa_move_pages,
+		 * or is that not necessary?
+		 */
+		numa_tonode_memory(ptr, sz, node);
+
+		ptr += sz;
+	}
+
+	/* should have processed all chunks */
+	Assert(ptr == endptr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 406ce77693c..e1e1cfd379d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -685,6 +685,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	BufferDesc *bufHdr;
 	BufferTag	tag;
 	uint32		buf_state;
+
 	Assert(BufferIsValid(recent_buffer));
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..876cb64cf66 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -145,6 +145,9 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+/* NUMA stuff */
+bool		numa_buffers_interleave = false;
+
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 511dc32d519..198a57e70a5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2116,6 +2116,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_buffers_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of shared buffers."),
+			gettext_noop("When enabled, the buffers in shared memory are interleaved to all NUMA nodes."),
+		},
+		&numa_buffers_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 69b6a877dc9..c07de903f76 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -305,7 +305,7 @@ static const char *progname;
 #define	CPU_PINNING_RANDOM		1
 #define	CPU_PINNING_COLOCATED	2
 
-static int pinning_mode = CPU_PINNING_NONE;
+static int	pinning_mode = CPU_PINNING_NONE;
 
 #define WSEP '@'				/* weight separator */
 
@@ -874,20 +874,20 @@ static bool socket_has_input(socket_set *sa, int fd, int idx);
  */
 typedef struct cpu_generator_state
 {
-	int		ncpus;		/* number of CPUs available */
-	int		nitems;		/* number of items in the queue */
-	int	   *nthreads;	/* number of threads for each CPU */
-	int	   *nclients;	/* number of processes for each CPU */
-	int	   *items;		/* queue of CPUs to pick from */
-} cpu_generator_state;
+	int			ncpus;			/* number of CPUs available */
+	int			nitems;			/* number of items in the queue */
+	int		   *nthreads;		/* number of threads for each CPU */
+	int		   *nclients;		/* number of processes for each CPU */
+	int		   *items;			/* queue of CPUs to pick from */
+}			cpu_generator_state;
 
 static cpu_generator_state cpu_generator_init(int ncpus);
-static void cpu_generator_refill(cpu_generator_state *state);
-static void cpu_generator_reset(cpu_generator_state *state);
-static int cpu_generator_thread(cpu_generator_state *state);
-static int cpu_generator_client(cpu_generator_state *state, int thread_cpu);
-static void cpu_generator_print(cpu_generator_state *state);
-static bool cpu_generator_check(cpu_generator_state *state);
+static void cpu_generator_refill(cpu_generator_state * state);
+static void cpu_generator_reset(cpu_generator_state * state);
+static int	cpu_generator_thread(cpu_generator_state * state);
+static int	cpu_generator_client(cpu_generator_state * state, int thread_cpu);
+static void cpu_generator_print(cpu_generator_state * state);
+static bool cpu_generator_check(cpu_generator_state * state);
 
 static void reset_pinning(TState *threads, int nthreads);
 
@@ -7422,7 +7422,7 @@ main(int argc, char **argv)
 	/* try to assign threads/clients to CPUs */
 	if (pinning_mode != CPU_PINNING_NONE)
 	{
-		int nprocs = get_nprocs();
+		int			nprocs = get_nprocs();
 		cpu_generator_state state = cpu_generator_init(nprocs);
 
 retry:
@@ -7433,6 +7433,7 @@ retry:
 		for (i = 0; i < nthreads; i++)
 		{
 			TState	   *thread = &threads[i];
+
 			thread->cpu = cpu_generator_thread(&state);
 		}
 
@@ -7444,7 +7445,7 @@ retry:
 		while (true)
 		{
 			/* did we find any unassigned backend? */
-			bool found = false;
+			bool		found = false;
 
 			for (i = 0; i < nthreads; i++)
 			{
@@ -7678,10 +7679,10 @@ threadRun(void *arg)
 		/* determine PID of the backend, pin it to the same CPU */
 		for (int i = 0; i < nstate; i++)
 		{
-			char   *pid_str;
-			pid_t	pid;
+			char	   *pid_str;
+			pid_t		pid;
 
-			PGresult *res = PQexec(state[i].con, "select pg_backend_pid()");
+			PGresult   *res = PQexec(state[i].con, "select pg_backend_pid()");
 
 			if (PQresultStatus(res) != PGRES_TUPLES_OK)
 				pg_fatal("could not determine PID of the backend for client %d",
@@ -8184,7 +8185,7 @@ cpu_generator_init(int ncpus)
 {
 	struct timeval tv;
 
-	cpu_generator_state	state;
+	cpu_generator_state state;
 
 	state.ncpus = ncpus;
 
@@ -8207,7 +8208,7 @@ cpu_generator_init(int ncpus)
 }
 
 static void
-cpu_generator_refill(cpu_generator_state *state)
+cpu_generator_refill(cpu_generator_state * state)
 {
 	struct timeval tv;
 
@@ -8223,7 +8224,7 @@ cpu_generator_refill(cpu_generator_state *state)
 }
 
 static void
-cpu_generator_reset(cpu_generator_state *state)
+cpu_generator_reset(cpu_generator_state * state)
 {
 	state->nitems = 0;
 	cpu_generator_refill(state);
@@ -8236,15 +8237,15 @@ cpu_generator_reset(cpu_generator_state *state)
 }
 
 static int
-cpu_generator_thread(cpu_generator_state *state)
+cpu_generator_thread(cpu_generator_state * state)
 {
 	if (state->nitems == 0)
 		cpu_generator_refill(state);
 
 	while (true)
 	{
-		int idx = lrand48() % state->nitems;
-		int cpu = state->items[idx];
+		int			idx = lrand48() % state->nitems;
+		int			cpu = state->items[idx];
 
 		state->items[idx] = state->items[state->nitems - 1];
 		state->nitems--;
@@ -8256,10 +8257,10 @@ cpu_generator_thread(cpu_generator_state *state)
 }
 
 static int
-cpu_generator_client(cpu_generator_state *state, int thread_cpu)
+cpu_generator_client(cpu_generator_state * state, int thread_cpu)
 {
-	int		min_clients;
-	bool	has_valid_cpus = false;
+	int			min_clients;
+	bool		has_valid_cpus = false;
 
 	for (int i = 0; i < state->nitems; i++)
 	{
@@ -8284,8 +8285,8 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
 
 	while (true)
 	{
-		int idx = lrand48() % state->nitems;
-		int cpu = state->items[idx];
+		int			idx = lrand48() % state->nitems;
+		int			cpu = state->items[idx];
 
 		if (cpu == thread_cpu)
 			continue;
@@ -8303,7 +8304,7 @@ cpu_generator_client(cpu_generator_state *state, int thread_cpu)
 }
 
 static void
-cpu_generator_print(cpu_generator_state *state)
+cpu_generator_print(cpu_generator_state * state)
 {
 	for (int i = 0; i < state->ncpus; i++)
 	{
@@ -8312,10 +8313,10 @@ cpu_generator_print(cpu_generator_state *state)
 }
 
 static bool
-cpu_generator_check(cpu_generator_state *state)
+cpu_generator_check(cpu_generator_state * state)
 {
-	int	min_count = INT_MAX,
-		max_count = 0;
+	int			min_count = INT_MAX,
+				max_count = 0;
 
 	for (int i = 0; i < state->ncpus; i++)
 	{
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..014a6079af2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -178,6 +178,8 @@ extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
+extern PGDLLIMPORT bool numa_buffers_interleave;
+
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
 extern PGDLLIMPORT int multixact_offset_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..c257c8a1c20 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -319,6 +319,7 @@ extern void EvictRelUnpinnedBuffers(Relation rel,
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
+extern int	BufferGetNode(Buffer buffer);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
-- 
2.49.0

From 6919b1c1c59a6084017ebae5a884bb6c60639364 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 22 May 2025 18:27:06 +0200
Subject: [PATCH v1 2/6] NUMA: localalloc

Set the default allocation policy to "localalloc", which means from the
local NUMA node. This is useful for process-private memory, which is not
going to be shared with other nodes, and is relatively short-lived (so
we're unlikely to have issues if the process gets moved by scheduler).

This sets default for the whole process, for all future allocations. But
that's fine, we've already populated the shared memory earlier (by
interleaving it explicitly). Otherwise we'd trigger page fault and it'd
be allocated on local node.

XXX This patch may not be necessary, as we now locate memory to nodes
using explicit numa_tonode_memory() calls, and not by interleaving. But
it's useful for experiments during development, so I'm keeping it.
---
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/init/miscinit.c   | 16 ++++++++++++++++
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 28 insertions(+)

diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 876cb64cf66..f5359db3656 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -147,6 +147,7 @@ int			MaxBackends = 0;
 
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
+bool		numa_localalloc = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 43b4dbccc3d..d11936691b2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -28,6 +28,10 @@
 #include <arpa/inet.h>
 #include <utime.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
+
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
@@ -164,6 +168,18 @@ InitPostmasterChild(void)
 				(errcode_for_socket_access(),
 				 errmsg_internal("could not set postmaster death monitoring pipe to FD_CLOEXEC mode: %m")));
 #endif
+
+#ifdef USE_LIBNUMA
+	/*
+	 * Set the default allocation policy to local node, where the task is
+	 * executing at the time of a page fault.
+	 *
+	 * XXX I believe this is not necessary, now that we don't use automatic
+	 * interleaving (numa_set_interleave_mask).
+	 */
+	if (numa_localalloc)
+		numa_set_localalloc();
+#endif
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 198a57e70a5..57f2df7ab74 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2126,6 +2126,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_localalloc", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables setting the default allocation policy to local node."),
+			gettext_noop("When enabled, allocate from the node where the task is executing."),
+		},
+		&numa_localalloc,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 014a6079af2..692871a401f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -179,6 +179,7 @@ extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
+extern PGDLLIMPORT bool numa_localalloc;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.49.0

From c2b2edb71d629ebe4283b636f058b8e42d1f1a35 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 14 Oct 2024 14:10:13 -0400
Subject: [PATCH v1 3/6] freelist: Don't track tail of a freelist

The freelist tail isn't currently used, making it unnecessary overhead.
So just don't do that.
---
 src/backend/storage/buffer/freelist.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..e046526c149 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -40,12 +40,6 @@ typedef struct
 	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
-	int			lastFreeBuffer; /* Tail of list of unused buffers */
-
-	/*
-	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
-	 * when the list is empty)
-	 */
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
@@ -371,8 +365,6 @@ StrategyFreeBuffer(BufferDesc *buf)
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
 		buf->freeNext = StrategyControl->firstFreeBuffer;
-		if (buf->freeNext < 0)
-			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}
 
@@ -509,7 +501,6 @@ StrategyInitialize(bool init)
 		 * assume it was previously set up by BufferManagerShmemInit().
 		 */
 		StrategyControl->firstFreeBuffer = 0;
-		StrategyControl->lastFreeBuffer = NBuffers - 1;
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
-- 
2.49.0

From 6505848ac8359c8c76dfbffc7150b6601ab07601 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 22 May 2025 18:38:41 +0200
Subject: [PATCH v1 4/6] NUMA: partition buffer freelist

Instead of a single buffer freelist, partition into multiple smaller
lists, to reduce lock contention, and to spread the buffers over all
NUMA nodes more evenly.

There are four strategies, specified by GUC numa_partition_freelist

* none - single long freelist, should work just like now

* node - one freelist per NUMA node, with only buffers from that node

* cpu - one freelist per CPU

* pid - freelist determined by PID (same number of freelists as 'cpu')

When allocating a buffer, it's taken from the correct freelist (e.g.
same NUMA node).

Note: This is (probably) more important than partitioning ProcArray.
---
 src/backend/storage/buffer/buf_init.c |   4 +-
 src/backend/storage/buffer/freelist.c | 324 +++++++++++++++++++++++---
 src/backend/utils/init/globals.c      |   1 +
 src/backend/utils/misc/guc_tables.c   |  18 ++
 src/include/miscadmin.h               |   1 +
 src/include/storage/bufmgr.h          |   8 +
 6 files changed, 327 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 2ad34624c49..920f1a32a8f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -543,8 +543,8 @@ pg_numa_interleave_memory(char *startptr, char *endptr,
 		 * XXX no return value, to make this fail on error, has to use
 		 * numa_set_strict
 		 *
-		 * XXX Should we still touch the memory first, like with numa_move_pages,
-		 * or is that not necessary?
+		 * XXX Should we still touch the memory first, like with
+		 * numa_move_pages, or is that not necessary?
 		 */
 		numa_tonode_memory(ptr, sz, node);
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e046526c149..c93ec2841c5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,14 +15,41 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
+#include <sys/sysinfo.h>
+
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
+/*
+ * Represents one freelist partition.
+ */
+typedef struct BufferStrategyFreelist
+{
+	/* Spinlock: protects the values below */
+	slock_t		freelist_lock;
+
+	/*
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres.
+	 */
+	int			firstFreeBuffer __attribute__((aligned(64)));	/* Head of list of
+																 * unused buffers */
+
+	/* Number of buffers consumed from this list. */
+	uint64		consumed;
+}			BufferStrategyFreelist;
 
 /*
  * The shared freelist control information.
@@ -39,8 +66,6 @@ typedef struct
 	 */
 	pg_atomic_uint32 nextVictimBuffer;
 
-	int			firstFreeBuffer;	/* Head of list of unused buffers */
-
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
@@ -51,13 +76,27 @@ typedef struct
 	/*
 	 * Bgworker process to be notified upon activity or -1 if none. See
 	 * StrategyNotifyBgWriter.
+	 *
+	 * XXX Not sure why this needs to be aligned like this. Need to ask
+	 * Andres. Also, shouldn't the alignment be specified after, like for
+	 * "consumed"?
 	 */
-	int			bgwprocno;
+	int			__attribute__((aligned(64))) bgwprocno;
+
+	BufferStrategyFreelist freelists[FLEXIBLE_ARRAY_MEMBER];
 } BufferStrategyControl;
 
 /* Pointers to shared state */
 static BufferStrategyControl *StrategyControl = NULL;
 
+/*
+ * XXX shouldn't this be in BufferStrategyControl? Probably not, we need to
+ * calculate it during sizing, and perhaps it could change before the memory
+ * gets allocated (so we need to remember the values).
+ */
+static int	strategy_nnodes;
+static int	strategy_ncpus;
+
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
  * This is currently the only kind of BufferAccessStrategy object, but someday
@@ -157,6 +196,90 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * ChooseFreeList
+ *		Pick the buffer freelist to use, depending on the CPU and NUMA node.
+ *
+ * Without partitioned freelists (numa_partition_freelist=false), there's only
+ * a single freelist, so use that.
+ *
+ * With partitioned freelists, we have multiple ways how to pick the freelist
+ * for the backend:
+ *
+ * - one freelist per CPU, use the freelist for CPU the task executes on
+ *
+ * - one freelist per NUMA node, use the freelist for node task executes on
+ *
+ * - use fixed number of freelists, map processes to lists based on PID
+ *
+ * There may be some other strategies, not sure. The important thing is this
+ * needs to be refrecled during initialization, i.e. we need to create the
+ * right number of lists.
+ */
+static BufferStrategyFreelist *
+ChooseFreeList(void)
+{
+	unsigned	cpu;
+	unsigned	node;
+	int			rc;
+
+	int			freelist_idx;
+
+	/* freelist not partitioned, return the first (and only) freelist */
+	if (numa_partition_freelist == FREELIST_PARTITION_NONE)
+		return &StrategyControl->freelists[0];
+
+	/*
+	 * freelist is partitioned, so determine the CPU/NUMA node, and pick a
+	 * list based on that.
+	 */
+	rc = getcpu(&cpu, &node);
+	if (rc != 0)
+		elog(ERROR, "getcpu failed: %m");
+
+	/*
+	 * FIXME This doesn't work well if CPUs are excluded from being run or
+	 * offline. In that case we end up not using some freelists at all, but
+	 * not sure if we need to worry about that. Probably not for now. But
+	 * could that change while the system is running?
+	 *
+	 * XXX Maybe we should somehow detect changes to the list of CPUs, and
+	 * rebuild the lists if that changes? But that seems expensive.
+	 */
+	if (cpu > strategy_ncpus)
+		elog(ERROR, "cpu out of range: %d > %u", cpu, strategy_ncpus);
+	else if (node > strategy_nnodes)
+		elog(ERROR, "node out of range: %d > %u", cpu, strategy_nnodes);
+
+	/*
+	 * Pick the freelist, based on CPU, NUMA node or process PID. This matches
+	 * how we built the freelists above.
+	 *
+	 * XXX Can we rely on some of the values (especially strategy_nnodes) to
+	 * be a power-of-2? Then we could replace the modulo with a mask, which is
+	 * likely more efficient.
+	 */
+	switch (numa_partition_freelist)
+	{
+		case FREELIST_PARTITION_CPU:
+			freelist_idx = cpu % strategy_ncpus;
+			break;
+
+		case FREELIST_PARTITION_NODE:
+			freelist_idx = node % strategy_nnodes;
+			break;
+
+		case FREELIST_PARTITION_PID:
+			freelist_idx = MyProcPid % strategy_ncpus;
+			break;
+
+		default:
+			elog(ERROR, "unknown freelist partitioning value");
+	}
+
+	return &StrategyControl->freelists[freelist_idx];
+}
+
 /*
  * have_free_buffer -- a lockless check to see if there is a free buffer in
  *					   buffer pool.
@@ -168,10 +291,13 @@ ClockSweepTick(void)
 bool
 have_free_buffer(void)
 {
-	if (StrategyControl->firstFreeBuffer >= 0)
-		return true;
-	else
-		return false;
+	for (int i = 0; i < strategy_ncpus; i++)
+	{
+		if (StrategyControl->freelists[i].firstFreeBuffer >= 0)
+			return true;
+	}
+
+	return false;
 }
 
 /*
@@ -193,6 +319,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	int			bgwprocno;
 	int			trycounter;
 	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */
+	BufferStrategyFreelist *freelist;
 
 	*from_ring = false;
 
@@ -259,31 +386,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
 	 * manipulate them without holding the spinlock.
 	 */
-	if (StrategyControl->firstFreeBuffer >= 0)
+	freelist = ChooseFreeList();
+	if (freelist->firstFreeBuffer >= 0)
 	{
 		while (true)
 		{
 			/* Acquire the spinlock to remove element from the freelist */
-			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			SpinLockAcquire(&freelist->freelist_lock);
 
-			if (StrategyControl->firstFreeBuffer < 0)
+			if (freelist->firstFreeBuffer < 0)
 			{
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				SpinLockRelease(&freelist->freelist_lock);
 				break;
 			}
 
-			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
+			buf = GetBufferDescriptor(freelist->firstFreeBuffer);
 			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
 			/* Unconditionally remove buffer from freelist */
-			StrategyControl->firstFreeBuffer = buf->freeNext;
+			freelist->firstFreeBuffer = buf->freeNext;
 			buf->freeNext = FREENEXT_NOT_IN_LIST;
 
+			/* increment number of buffers we consumed from this list */
+			freelist->consumed++;
+
 			/*
 			 * Release the lock so someone else can access the freelist while
 			 * we check out this buffer.
 			 */
-			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			SpinLockRelease(&freelist->freelist_lock);
 
 			/*
 			 * If the buffer is pinned or has a nonzero usage_count, we cannot
@@ -305,7 +436,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		}
 	}
 
-	/* Nothing on the freelist, so run the "clock sweep" algorithm */
+	/*
+	 * Nothing on the freelist, so run the "clock sweep" algorithm
+	 *
+	 * XXX Should we also make this NUMA-aware, to only access buffers from
+	 * the same NUMA node? That'd probably mean we need to make the clock
+	 * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
+	 * subset of buffers. But that also means each process could "sweep" only
+	 * a fraction of buffers, even if the other buffers are better candidates
+	 * for eviction. Would that also mean we'd have multiple bgwriters, one
+	 * for each node, or would one bgwriter handle all of that?
+	 */
 	trycounter = NBuffers;
 	for (;;)
 	{
@@ -352,11 +493,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 
 /*
  * StrategyFreeBuffer: put a buffer on the freelist
+ *
+ * XXX This calls ChooseFreeList() again, and it might return the freelist to
+ * a different freelist than it was taken from (either by a different backend,
+ * or perhaps even the same backend running on a different CPU). Is that good?
+ * Maybe we should try to balance this somehow, e.g. by choosing a random list,
+ * the shortest one, or something like that? But that breaks the whole idea of
+ * having freelists with buffers from a particular NUMA node.
  */
 void
 StrategyFreeBuffer(BufferDesc *buf)
 {
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	BufferStrategyFreelist *freelist;
+
+	freelist = ChooseFreeList();
+
+	SpinLockAcquire(&freelist->freelist_lock);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -364,11 +516,11 @@ StrategyFreeBuffer(BufferDesc *buf)
 	 */
 	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
 	{
-		buf->freeNext = StrategyControl->firstFreeBuffer;
-		StrategyControl->firstFreeBuffer = buf->buf_id;
+		buf->freeNext = freelist->firstFreeBuffer;
+		freelist->firstFreeBuffer = buf->buf_id;
 	}
 
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+	SpinLockRelease(&freelist->freelist_lock);
 }
 
 /*
@@ -432,6 +584,42 @@ StrategyNotifyBgWriter(int bgwprocno)
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
+/* prints some debug info / stats about freelists at shutdown */
+static void
+freelist_before_shmem_exit(int code, Datum arg)
+{
+	for (int node = 0; node < strategy_ncpus; node++)
+	{
+		BufferStrategyFreelist *freelist = &StrategyControl->freelists[node];
+		uint64		remain = 0;
+		uint64		actually_free = 0;
+		int			cur = freelist->firstFreeBuffer;
+
+		while (cur >= 0)
+		{
+			uint32		local_buf_state;
+			BufferDesc *buf;
+
+			buf = GetBufferDescriptor(cur);
+
+			remain++;
+
+			local_buf_state = LockBufHdr(buf);
+
+			if (!(local_buf_state & BM_TAG_VALID))
+				actually_free++;
+
+			UnlockBufHdr(buf, local_buf_state);
+
+			cur = buf->freeNext;
+		}
+		elog(LOG, "freelist %d, firstF: %d: consumed: %lu, remain: %lu, actually free: %lu",
+			 node,
+			 freelist->firstFreeBuffer,
+			 freelist->consumed,
+			 remain, actually_free);
+	}
+}
 
 /*
  * StrategyShmemSize
@@ -446,11 +634,33 @@ StrategyShmemSize(void)
 {
 	Size		size = 0;
 
+	/* FIXME */
+#ifdef USE_LIBNUMA
+	strategy_ncpus = numa_num_task_cpus();
+	strategy_nnodes = numa_num_task_nodes();
+#else
+	strategy_ncpus = 1;
+	strategy_nnodes = 1;
+#endif
+
+	Assert(strategy_nnodes <= strategy_ncpus);
+
 	/* size of lookup hash table ... see comment in StrategyInitialize */
 	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
-	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
+	size = add_size(size, MAXALIGN(offsetof(BufferStrategyControl, freelists)));
+
+	/*
+	 * Allocate one frelist per CPU. We might use per-node freelists, but the
+	 * assumption is the number of CPUs is less than number of NUMA nodes.
+	 *
+	 * FIXME This assumes the we have more CPUs than NUMA nodes, which seems
+	 * like a safe assumption. But maybe we should calculate how many elements
+	 * we actually need, depending on the GUC? Not a huge amount of memory.
+	 */
+	size = add_size(size, MAXALIGN(mul_size(sizeof(BufferStrategyFreelist),
+											strategy_ncpus)));
 
 	return size;
 }
@@ -466,6 +676,7 @@ void
 StrategyInitialize(bool init)
 {
 	bool		found;
+	int			buffers_per_cpu;
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
@@ -484,23 +695,27 @@ StrategyInitialize(bool init)
 	 */
 	StrategyControl = (BufferStrategyControl *)
 		ShmemInitStruct("Buffer Strategy Status",
-						sizeof(BufferStrategyControl),
+						offsetof(BufferStrategyControl, freelists) +
+						(sizeof(BufferStrategyFreelist) * strategy_ncpus),
 						&found);
 
 	if (!found)
 	{
+		/*
+		 * XXX Calling get_nprocs() may not be quite correct, because some of
+		 * the processors may get disabled, etc.
+		 */
+		int			num_cpus = get_nprocs();
+
 		/*
 		 * Only done once, usually in postmaster
 		 */
 		Assert(init);
 
-		SpinLockInit(&StrategyControl->buffer_strategy_lock);
+		/* register callback to dump some stats on exit */
+		before_shmem_exit(freelist_before_shmem_exit, 0);
 
-		/*
-		 * Grab the whole linked list of free buffers for our strategy. We
-		 * assume it was previously set up by BufferManagerShmemInit().
-		 */
-		StrategyControl->firstFreeBuffer = 0;
+		SpinLockInit(&StrategyControl->buffer_strategy_lock);
 
 		/* Initialize the clock sweep pointer */
 		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
@@ -511,6 +726,61 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwprocno = -1;
+
+		/*
+		 * Rebuild the freelist - right now all buffers are in one huge list,
+		 * we want to rework that into multiple lists. Start by initializing
+		 * the strategy to have empty lists.
+		 */
+		for (int nfreelist = 0; nfreelist < strategy_ncpus; nfreelist++)
+		{
+			BufferStrategyFreelist *freelist;
+
+			freelist = &StrategyControl->freelists[nfreelist];
+
+			freelist->firstFreeBuffer = FREENEXT_END_OF_LIST;
+
+			SpinLockInit(&freelist->freelist_lock);
+		}
+
+		/* buffers per CPU (also used for PID partitioning) */
+		buffers_per_cpu = (NBuffers / strategy_ncpus);
+
+		elog(LOG, "NBuffers: %d, nodes %d, ncpus: %d, divide: %d, remain: %d",
+			 NBuffers, strategy_nnodes, strategy_ncpus,
+			 buffers_per_cpu, NBuffers - (strategy_ncpus * buffers_per_cpu));
+
+		/*
+		 * Walk through the buffers, add them to the correct list. Walk from
+		 * the end, because we're adding the buffers to the beginning.
+		 */
+		for (int i = NBuffers - 1; i >= 0; i--)
+		{
+			BufferDesc *buf = GetBufferDescriptor(i);
+			BufferStrategyFreelist *freelist;
+			int			belongs_to = 0; /* first freelist by default */
+
+			/*
+			 * Split the freelist into partitions, if needed (or just keep the
+			 * freelist we already built in BufferManagerShmemInit().
+			 */
+			if ((numa_partition_freelist == FREELIST_PARTITION_CPU) ||
+				(numa_partition_freelist == FREELIST_PARTITION_PID))
+			{
+				belongs_to = (i % num_cpus);
+			}
+			else if (numa_partition_freelist == FREELIST_PARTITION_NODE)
+			{
+				/* determine NUMA node for buffer */
+				belongs_to = BufferGetNode(i);
+			}
+
+			/* add to the right freelist */
+			freelist = &StrategyControl->freelists[belongs_to];
+
+			buf->freeNext = freelist->firstFreeBuffer;
+			freelist->firstFreeBuffer = i;
+		}
 	}
 	else
 		Assert(!init);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f5359db3656..7febf3001a3 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,6 +148,7 @@ int			MaxBackends = 0;
 /* NUMA stuff */
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
+int			numa_partition_freelist = 0;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 57f2df7ab74..e2361c161e6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -491,6 +491,14 @@ static const struct config_enum_entry file_copy_method_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry freelist_partition_options[] = {
+	{"none", FREELIST_PARTITION_NONE, false},
+	{"node", FREELIST_PARTITION_NODE, false},
+	{"cpu", FREELIST_PARTITION_CPU, false},
+	{"pid", FREELIST_PARTITION_PID, false},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -5284,6 +5292,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_partition_freelist", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Enables buffer freelists to be partitioned per NUMA node."),
+			gettext_noop("When enabled, we create a separate freelist per NUMA node."),
+		},
+		&numa_partition_freelist,
+		FREELIST_PARTITION_NONE, freelist_partition_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"wal_sync_method", PGC_SIGHUP, WAL_SETTINGS,
 			gettext_noop("Selects the method used for forcing WAL updates to disk."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 692871a401f..17528439f07 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -180,6 +180,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
+extern PGDLLIMPORT int numa_partition_freelist;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c257c8a1c20..efb7e28c10f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -93,6 +93,14 @@ typedef enum ExtendBufferedFlags
 	EB_LOCK_TARGET = (1 << 5),
 }			ExtendBufferedFlags;
 
+typedef enum FreelistPartitionMode
+{
+	FREELIST_PARTITION_NONE,
+	FREELIST_PARTITION_NODE,
+	FREELIST_PARTITION_CPU,
+	FREELIST_PARTITION_PID,
+}			FreelistPartitionMode;
+
 /*
  * Some functions identify relations either by relation or smgr +
  * relpersistence.  Used via the BMR_REL()/BMR_SMGR() macros below.  This
-- 
2.49.0

From 05c594ed8eb8a266a74038c3131d12bb03d897e3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Thu, 22 May 2025 18:39:08 +0200
Subject: [PATCH v1 5/6] NUMA: interleave PGPROC entries

The goal is to distribute ProcArray (or rather PGPROC entries and
associated fast-path arrays) to NUMA nodes.

We can't do this by simply interleaving pages, because that wouldn't
work for both parts at the same time. We want to place the PGPROC and
it's fast-path locking structs on the same node, but the structs are
of different sizes, etc.

Another problem is that PGPROC entries are fairly small, so with huge
pages and reasonable values of max_connections everything fits onto a
single page. We don't want to make this incompatible with huge pages.

Note: If we eventually switch to allocating separate shared segments for
different parts (to allow on-line resizing), we could keep using regular
pages for procarray, and this would not be such an issue.

To make this work, we split the PGPROC array into per-node segments,
each with about (MaxBackends / numa_nodes) entries, and one segment for
auxiliary processes and prepared transations. And we do the same thing
for fast-path arrays.

The PGPROC segments are laid out like this (e.g. for 2 NUMA nodes):

 - PGPROC array / node #0
 - PGPROC array / node #1
 - PGPROC array / aux processes + 2PC transactions
 - fast-path arrays / node #0
 - fast-path arrays / node #1
 - fast-path arrays / aux processes + 2PC transaction

Each segment is aligned to (starts at) memory page, and is effectively a
multiple of multiple memory pages.

Having a single PGPROC array made certain operations easiers - e.g. it
was possible to iterate the array, and GetNumberFromPGProc() could
calculate offset by simply subtracting PGPROC pointers. With multiple
segments that's not possible, but the fallout is minimal.

Most places accessed PGPROC through PROC_HDR->allProcs, and can continue
to do so, except that now they get a pointer to the PGPROC (which most
places wanted anyway).

Note: There's an indirection, though. But the pointer does not change,
so hopefully that's not an issue. And each PGPROC entry gets an explicit
procnumber field, which is the index in allProcs, GetNumberFromPGProc
can simply return that.

Each PGPROC also gets numa_node, tracking the NUMA node, so that we
don't have to recalculate that. This is used by InitProcess() to pick
a PGPROC entry from the local NUMA node.

Note: The scheduler may migrate the process to a different CPU/node
later. Maybe we should consider pinning the process to the node?
---
 src/backend/access/transam/clog.c      |   4 +-
 src/backend/postmaster/pgarch.c        |   2 +-
 src/backend/postmaster/walsummarizer.c |   2 +-
 src/backend/storage/buffer/freelist.c  |   2 +-
 src/backend/storage/ipc/procarray.c    |  62 ++---
 src/backend/storage/lmgr/lock.c        |   6 +-
 src/backend/storage/lmgr/proc.c        | 368 +++++++++++++++++++++++--
 src/backend/utils/init/globals.c       |   1 +
 src/backend/utils/misc/guc_tables.c    |  10 +
 src/include/miscadmin.h                |   1 +
 src/include/storage/proc.h             |  11 +-
 11 files changed, 407 insertions(+), 62 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 48f10bec91e..90ddff37bc6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -576,7 +576,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	/* Walk the list and update the status of all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &ProcGlobal->allProcs[nextidx];
+		PGPROC	   *nextproc = ProcGlobal->allProcs[nextidx];
 		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
@@ -635,7 +635,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *wakeproc = &ProcGlobal->allProcs[wakeidx];
+		PGPROC	   *wakeproc = ProcGlobal->allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&wakeproc->clogGroupNext);
 		pg_atomic_write_u32(&wakeproc->clogGroupNext, INVALID_PROC_NUMBER);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7e622ae4bd2..75c0e4bf53c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -289,7 +289,7 @@ PgArchWakeup(void)
 	 * be relaunched shortly and will start archiving.
 	 */
 	if (arch_pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[arch_pgprocno]->procLatch);
 }
 
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 0fec4f1f871..0044ef54363 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -649,7 +649,7 @@ WakeupWalSummarizer(void)
 	LWLockRelease(WALSummarizerLock);
 
 	if (pgprocno != INVALID_PROC_NUMBER)
-		SetLatch(&ProcGlobal->allProcs[pgprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[pgprocno]->procLatch);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c93ec2841c5..4e390a77a71 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -360,7 +360,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 		 * actually fine because procLatch isn't ever freed, so we just can
 		 * potentially set the wrong process' (or no process') latch.
 		 */
-		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+		SetLatch(&ProcGlobal->allProcs[bgwprocno]->procLatch);
 	}
 
 	/*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e5b945a9ee3..3277480fbcf 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -268,7 +268,7 @@ typedef enum KAXCompressReason
 
 static ProcArrayStruct *procArray;
 
-static PGPROC *allProcs;
+static PGPROC **allProcs;
 
 /*
  * Cache to reduce overhead of repeated calls to TransactionIdIsInProgress()
@@ -502,7 +502,7 @@ ProcArrayAdd(PGPROC *proc)
 		int			this_procno = arrayP->pgprocnos[index];
 
 		Assert(this_procno >= 0 && this_procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[this_procno].pgxactoff == index);
+		Assert(allProcs[this_procno]->pgxactoff == index);
 
 		/* If we have found our right position in the array, break */
 		if (this_procno > pgprocno)
@@ -538,9 +538,9 @@ ProcArrayAdd(PGPROC *proc)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff == index - 1);
+		Assert(allProcs[procno]->pgxactoff == index - 1);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -581,7 +581,7 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	myoff = proc->pgxactoff;
 
 	Assert(myoff >= 0 && myoff < arrayP->numProcs);
-	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]].pgxactoff == myoff);
+	Assert(ProcGlobal->allProcs[arrayP->pgprocnos[myoff]]->pgxactoff == myoff);
 
 	if (TransactionIdIsValid(latestXid))
 	{
@@ -636,9 +636,9 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		int			procno = arrayP->pgprocnos[index];
 
 		Assert(procno >= 0 && procno < (arrayP->maxProcs + NUM_AUXILIARY_PROCS));
-		Assert(allProcs[procno].pgxactoff - 1 == index);
+		Assert(allProcs[procno]->pgxactoff - 1 == index);
 
-		allProcs[procno].pgxactoff = index;
+		allProcs[procno]->pgxactoff = index;
 	}
 
 	/*
@@ -860,7 +860,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	/* Walk the list and clear all XIDs. */
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[nextidx];
+		PGPROC	   *nextproc = allProcs[nextidx];
 
 		ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
 
@@ -880,7 +880,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
 	 */
 	while (wakeidx != INVALID_PROC_NUMBER)
 	{
-		PGPROC	   *nextproc = &allProcs[wakeidx];
+		PGPROC	   *nextproc = allProcs[wakeidx];
 
 		wakeidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
 		pg_atomic_write_u32(&nextproc->procArrayGroupNext, INVALID_PROC_NUMBER);
@@ -1526,7 +1526,7 @@ TransactionIdIsInProgress(TransactionId xid)
 		pxids = other_subxidstates[pgxactoff].count;
 		pg_read_barrier();		/* pairs with barrier in GetNewTransactionId() */
 		pgprocno = arrayP->pgprocnos[pgxactoff];
-		proc = &allProcs[pgprocno];
+		proc = allProcs[pgprocno];
 		for (j = pxids - 1; j >= 0; j--)
 		{
 			/* Fetch xid just once - see GetNewTransactionId */
@@ -1650,7 +1650,7 @@ TransactionIdIsActive(TransactionId xid)
 	for (i = 0; i < arrayP->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		TransactionId pxid;
 
 		/* Fetch xid just once - see GetNewTransactionId */
@@ -1792,7 +1792,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	for (int index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int8		statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 		TransactionId xmin;
@@ -2276,7 +2276,7 @@ GetSnapshotData(Snapshot snapshot)
 			TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
 			uint8		statusFlags;
 
-			Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
+			Assert(allProcs[arrayP->pgprocnos[pgxactoff]]->pgxactoff == pgxactoff);
 
 			/*
 			 * If the transaction has no XID assigned, we can skip it; it
@@ -2350,7 +2350,7 @@ GetSnapshotData(Snapshot snapshot)
 					if (nsubxids > 0)
 					{
 						int			pgprocno = pgprocnos[pgxactoff];
-						PGPROC	   *proc = &allProcs[pgprocno];
+						PGPROC	   *proc = allProcs[pgprocno];
 
 						pg_read_barrier();	/* pairs with GetNewTransactionId */
 
@@ -2551,7 +2551,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		int			statusFlags = ProcGlobal->statusFlags[index];
 		TransactionId xid;
 
@@ -2777,7 +2777,7 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestDatabaseRunningXid))
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			if (proc->databaseId == MyDatabaseId)
 				oldestDatabaseRunningXid = xid;
@@ -2808,7 +2808,7 @@ GetRunningTransactionData(void)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			int			nsubxids;
 
 			/*
@@ -3058,7 +3058,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if ((proc->delayChkptFlags & type) != 0)
 		{
@@ -3099,7 +3099,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId vxid;
 
 		GET_VXID_FROM_PGPROC(vxid, *proc);
@@ -3227,7 +3227,7 @@ BackendPidGetProcWithLock(int pid)
 
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
-		PGPROC	   *proc = &allProcs[arrayP->pgprocnos[index]];
+		PGPROC	   *proc = allProcs[arrayP->pgprocnos[index]];
 
 		if (proc->pid == pid)
 		{
@@ -3270,7 +3270,7 @@ BackendXidGetPid(TransactionId xid)
 		if (other_xids[index] == xid)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 
 			result = proc->pid;
 			break;
@@ -3339,7 +3339,7 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 		if (proc == MyProc)
@@ -3441,7 +3441,7 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/* Exclude prepared transactions */
 		if (proc->pid == 0)
@@ -3506,7 +3506,7 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 		VirtualTransactionId procvxid;
 
 		GET_VXID_FROM_PGPROC(procvxid, *proc);
@@ -3561,7 +3561,7 @@ MinimumActiveBackends(int min)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		/*
 		 * Since we're not holding a lock, need to be prepared to deal with
@@ -3607,7 +3607,7 @@ CountDBBackends(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3636,7 +3636,7 @@ CountDBConnections(Oid databaseid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3667,7 +3667,7 @@ CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
 		{
@@ -3708,7 +3708,7 @@ CountUserBackends(Oid roleid)
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->pid == 0)
 			continue;			/* do not count prepared xacts */
@@ -3771,7 +3771,7 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
 		for (index = 0; index < arrayP->numProcs; index++)
 		{
 			int			pgprocno = arrayP->pgprocnos[index];
-			PGPROC	   *proc = &allProcs[pgprocno];
+			PGPROC	   *proc = allProcs[pgprocno];
 			uint8		statusFlags = ProcGlobal->statusFlags[index];
 
 			if (proc->databaseId != databaseId)
@@ -3837,7 +3837,7 @@ TerminateOtherDBBackends(Oid databaseId)
 	for (i = 0; i < procArray->numProcs; i++)
 	{
 		int			pgprocno = arrayP->pgprocnos[i];
-		PGPROC	   *proc = &allProcs[pgprocno];
+		PGPROC	   *proc = allProcs[pgprocno];
 
 		if (proc->databaseId != databaseId)
 			continue;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2776ceb295b..95b1da42408 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2844,7 +2844,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; i++)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 		uint32		j;
 
 		LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
@@ -3103,7 +3103,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
 		 */
 		for (i = 0; i < ProcGlobal->allProcCount; i++)
 		{
-			PGPROC	   *proc = &ProcGlobal->allProcs[i];
+			PGPROC	   *proc = ProcGlobal->allProcs[i];
 			uint32		j;
 
 			/* A backend never blocks itself */
@@ -3790,7 +3790,7 @@ GetLockStatusData(void)
 	 */
 	for (i = 0; i < ProcGlobal->allProcCount; ++i)
 	{
-		PGPROC	   *proc = &ProcGlobal->allProcs[i];
+		PGPROC	   *proc = ProcGlobal->allProcs[i];
 
 		/* Skip backends with pid=0, as they don't hold fast-path locks */
 		if (proc->pid == 0)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..9d3e94a7b3a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -29,21 +29,29 @@
  */
 #include "postgres.h"
 
+#include <sched.h>
 #include <signal.h>
 #include <unistd.h>
 #include <sys/time.h>
 
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#include <numaif.h>
+#endif
+
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_numa.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slotsync.h"
 #include "replication/syncrep.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -89,6 +97,12 @@ static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
 
+/* NUMA */
+static Size get_memory_page_size(void); /* XXX duplicate */
+static void move_to_node(char *startptr, char *endptr,
+						 Size mem_page_size, int node);
+static int	numa_nodes = -1;
+
 
 /*
  * Report shared-memory space needed by PGPROC.
@@ -100,11 +114,40 @@ PGProcShmemSize(void)
 	Size		TotalProcs =
 		add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts));
 
+	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC *)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->xids)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, mul_size(TotalProcs, sizeof(*ProcGlobal->statusFlags)));
 
+	/*
+	 * With NUMA, we allocate the PGPROC array in several chunks. With shared
+	 * buffers we simply manually assign parts of the buffer array to
+	 * different NUMA nodes, and that does the trick. But we can't do that for
+	 * PGPROC, as the number of PGPROC entries is much lower, especially with
+	 * huge pages. We can fit ~2k entries on a 2MB page, and NUMA does stuff
+	 * with page granularity, and the large NUMA systems are likely to use
+	 * huge pages. So with sensible max_connections we would not use more than
+	 * a single page, which means it gets to a single NUMA node.
+	 *
+	 * So we allocate PGPROC not as a single array, but one array per NUMA
+	 * node, and then one array for aux processes (without NUMA node
+	 * assigned). Each array may need up to memory-page-worth of padding,
+	 * worst case. So we just add that - it's a bit wasteful, but good enough
+	 * for PoC.
+	 *
+	 * FIXME Should be conditional, but that was causing problems in bootstrap
+	 * mode. Or maybe it was because the code that allocates stuff later does
+	 * not do that conditionally. Anyway, needs to be fixed.
+	 */
+	/* if (numa_procs_interleave) */
+	{
+		int			num_nodes = numa_num_configured_nodes();
+		Size		mem_page_size = get_memory_page_size();
+
+		size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+	}
+
 	return size;
 }
 
@@ -129,6 +172,26 @@ FastPathLockShmemSize(void)
 
 	size = add_size(size, mul_size(TotalProcs, (fpLockBitsSize + fpRelIdSize)));
 
+	/*
+	 * Same NUMA-padding logic as in PGProcShmemSize, adding a memory page per
+	 * NUMA node - but this way we add two pages per node - one for PGPROC,
+	 * one for fast-path arrays. In theory we could make this work just one
+	 * page per node, by adding fast-path arrays right after PGPROC entries on
+	 * each node. But now we allocate fast-path locks separately - good enough
+	 * for PoC.
+	 *
+	 * FIXME Should be conditional, but that was causing problems in bootstrap
+	 * mode. Or maybe it was because the code that allocates stuff later does
+	 * not do that conditionally. Anyway, needs to be fixed.
+	 */
+	/* if (numa_procs_interleave) */
+	{
+		int			num_nodes = numa_num_configured_nodes();
+		Size		mem_page_size = get_memory_page_size();
+
+		size = add_size(size, mul_size((num_nodes + 1), mem_page_size));
+	}
+
 	return size;
 }
 
@@ -191,11 +254,13 @@ ProcGlobalSemas(void)
 void
 InitProcGlobal(void)
 {
-	PGPROC	   *procs;
+	PGPROC	  **procs;
 	int			i,
 				j;
 	bool		found;
 	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+	int			procs_total;
+	int			procs_per_node;
 
 	/* Used for setup of per-backend fast-path slots. */
 	char	   *fpPtr,
@@ -205,6 +270,8 @@ InitProcGlobal(void)
 	Size		requestSize;
 	char	   *ptr;
 
+	Size		mem_page_size = get_memory_page_size();
+
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
 		ShmemInitStruct("Proc Header", sizeof(PROC_HDR), &found);
@@ -224,6 +291,9 @@ InitProcGlobal(void)
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
 	pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PROC_NUMBER);
 
+	/* one chunk per NUMA node (without NUMA assume 1 node) */
+	numa_nodes = numa_num_configured_nodes();
+
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
 	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
@@ -241,19 +311,108 @@ InitProcGlobal(void)
 
 	MemSet(ptr, 0, requestSize);
 
-	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	/* allprocs (array of pointers to PGPROC entries) */
+	procs = (PGPROC **) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC *);
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
 
+	/*
+	 * NUMA partitioning
+	 *
+	 * Now build the actual PGPROC arrays, one "chunk" per NUMA node (and one
+	 * extra for auxiliary processes and 2PC transactions, not associated with
+	 * any particular node).
+	 *
+	 * First determine how many "backend" procs to allocate per NUMA node. The
+	 * count may not be exactly divisible, but we mostly ignore that. The last
+	 * node may get somewhat fewer PGPROC entries, but the imbalance ought to
+	 * be pretty small (if MaxBackends >> numa_nodes).
+	 *
+	 * XXX A fairer distribution is possible, but not worth it now.
+	 */
+	procs_per_node = (MaxBackends + (numa_nodes - 1)) / numa_nodes;
+	procs_total = 0;
+
+	/* build PGPROC entries for NUMA nodes */
+	for (i = 0; i < numa_nodes; i++)
+	{
+		PGPROC	   *procs_node;
+
+		/* the last NUMA node may get fewer PGPROC entries, but meh */
+		int			count_node = Min(procs_per_node, MaxBackends - procs_total);
+
+		/* make sure to align the PGPROC array to memory page */
+		ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+		/* allocate the PGPROC chunk for this node */
+		procs_node = (PGPROC *) ptr;
+		ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		/* add pointers to the PGPROC entries to allProcs */
+		for (j = 0; j < count_node; j++)
+		{
+			procs_node[j].numa_node = i;
+			procs_node[j].procnumber = procs_total;
+
+			ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+		}
+
+		move_to_node((char *) procs_node, ptr, mem_page_size, i);
+	}
+
+	/*
+	 * also build PGPROC entries for auxiliary procs / prepared xacts (we
+	 * don't assign those to any NUMA node)
+	 *
+	 * XXX Mostly duplicate of preceding block, could be reused.
+	 */
+	{
+		PGPROC	   *procs_node;
+		int			count_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/*
+		 * Make sure to align PGPROC array to memory page (it may not be
+		 * aligned). We won't assign this to any NUMA node, but we still don't
+		 * want it to interfere with the preceding chunk (for the last NUMA
+		 * node).
+		 */
+		ptr = (char *) TYPEALIGN(mem_page_size, ptr);
+
+		procs_node = (PGPROC *) ptr;
+		ptr = (char *) ptr + count_node * sizeof(PGPROC);
+
+		/* don't overflow the allocation */
+		Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
+
+		/* now add the PGPROC pointers to allProcs */
+		for (j = 0; j < count_node; j++)
+		{
+			procs_node[j].numa_node = -1;
+			procs_node[j].procnumber = procs_total;
+
+			ProcGlobal->allProcs[procs_total++] = &procs_node[j];
+		}
+	}
+
+	/* we should have allocated the expected number of PGPROC entries */
+	Assert(procs_total == TotalProcs);
+
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * XXX Would it make sense to NUMA-partition these chunks too, somehow?
+	 * But those arrays are tiny, fit into a single memory page, so would need
+	 * to be made more complex. Not sure.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
@@ -286,23 +445,100 @@ InitProcGlobal(void)
 	/* For asserts checking we did not overflow. */
 	fpEndPtr = fpPtr + requestSize;
 
-	for (i = 0; i < TotalProcs; i++)
+	/* reset the count */
+	procs_total = 0;
+
+	/*
+	 * Mimic the same logic as above, but for fast-path locking.
+	 */
+	for (i = 0; i < numa_nodes; i++)
 	{
-		PGPROC	   *proc = &procs[i];
+		char	   *startptr;
+		char	   *endptr;
 
-		/* Common initialization for all PGPROCs, regardless of type. */
+		/* the last NUMA node may get fewer PGPROC entries, but meh */
+		int			procs_node = Min(procs_per_node, MaxBackends - procs_total);
+
+		/* align to memory page, to make move_pages possible */
+		fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+		startptr = fpPtr;
+		endptr = fpPtr + procs_node * (fpLockBitsSize + fpRelIdSize);
+
+		move_to_node(startptr, endptr, mem_page_size, i);
 
 		/*
-		 * Set the fast-path lock arrays, and move the pointer. We interleave
-		 * the two arrays, to (hopefully) get some locality for each backend.
+		 * Now point the PGPROC entries to the fast-path arrays, and also
+		 * advance the fpPtr.
 		 */
-		proc->fpLockBits = (uint64 *) fpPtr;
-		fpPtr += fpLockBitsSize;
+		for (j = 0; j < procs_node; j++)
+		{
+			PGPROC	   *proc = ProcGlobal->allProcs[procs_total++];
+
+			/* cross-check we got the expected NUMA node */
+			Assert(proc->numa_node == i);
+			Assert(proc->procnumber == (procs_total - 1));
+
+			/*
+			 * Set the fast-path lock arrays, and move the pointer. We
+			 * interleave the two arrays, to (hopefully) get some locality for
+			 * each backend.
+			 */
+			proc->fpLockBits = (uint64 *) fpPtr;
+			fpPtr += fpLockBitsSize;
 
-		proc->fpRelId = (Oid *) fpPtr;
-		fpPtr += fpRelIdSize;
+			proc->fpRelId = (Oid *) fpPtr;
+			fpPtr += fpRelIdSize;
 
-		Assert(fpPtr <= fpEndPtr);
+			Assert(fpPtr <= fpEndPtr);
+		}
+
+		Assert(fpPtr == endptr);
+	}
+
+	/* auxiliary processes / prepared xacts */
+	{
+		/* the last NUMA node may get fewer PGPROC entries, but meh */
+		int			procs_node = (NUM_AUXILIARY_PROCS + max_prepared_xacts);
+
+		/* align to memory page, to make move_pages possible */
+		fpPtr = (char *) TYPEALIGN(mem_page_size, fpPtr);
+
+		/* now point the PGPROC entries to the fast-path arrays */
+		for (j = 0; j < procs_node; j++)
+		{
+			PGPROC	   *proc = ProcGlobal->allProcs[procs_total++];
+
+			/* cross-check we got PGPROC with no NUMA node assigned */
+			Assert(proc->numa_node == -1);
+			Assert(proc->procnumber == (procs_total - 1));
+
+			/*
+			 * Set the fast-path lock arrays, and move the pointer. We
+			 * interleave the two arrays, to (hopefully) get some locality for
+			 * each backend.
+			 */
+			proc->fpLockBits = (uint64 *) fpPtr;
+			fpPtr += fpLockBitsSize;
+
+			proc->fpRelId = (Oid *) fpPtr;
+			fpPtr += fpRelIdSize;
+
+			Assert(fpPtr <= fpEndPtr);
+		}
+	}
+
+	/* Should have consumed exactly the expected amount of fast-path memory. */
+	Assert(fpPtr <= fpEndPtr);
+
+	/* make sure we allocated the expected number of PGPROC entries */
+	Assert(procs_total == TotalProcs);
+
+	for (i = 0; i < TotalProcs; i++)
+	{
+		PGPROC	   *proc = procs[i];
+
+		Assert(proc->procnumber == i);
 
 		/*
 		 * Set up per-PGPROC semaphore, latch, and fpInfoLock.  Prepared xact
@@ -366,15 +602,12 @@ InitProcGlobal(void)
 		pg_atomic_init_u64(&(proc->waitStart), 0);
 	}
 
-	/* Should have consumed exactly the expected amount of fast-path memory. */
-	Assert(fpPtr == fpEndPtr);
-
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
 	 */
-	AuxiliaryProcs = &procs[MaxBackends];
-	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
+	AuxiliaryProcs = procs[MaxBackends];
+	PreparedXactProcs = procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
 	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
@@ -435,7 +668,45 @@ InitProcess(void)
 
 	if (!dlist_is_empty(procgloballist))
 	{
-		MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+		/*
+		 * With numa interleaving of PGPROC, try to get a PROC entry from the
+		 * right NUMA node (when the process starts).
+		 *
+		 * XXX The process may move to a different NUMA node later, but
+		 * there's not much we can do about that.
+		 */
+		if (numa_procs_interleave)
+		{
+			dlist_mutable_iter iter;
+			unsigned	cpu;
+			unsigned	node;
+			int			rc;
+
+			rc = getcpu(&cpu, &node);
+			if (rc != 0)
+				elog(ERROR, "getcpu failed: %m");
+
+			MyProc = NULL;
+
+			dlist_foreach_modify(iter, procgloballist)
+			{
+				PGPROC	   *proc;
+
+				proc = dlist_container(PGPROC, links, iter.cur);
+
+				if (proc->numa_node == node)
+				{
+					MyProc = proc;
+					dlist_delete(iter.cur);
+					break;
+				}
+			}
+		}
+
+		/* didn't find PGPROC from the correct NUMA node, pick any free one */
+		if (MyProc == NULL)
+			MyProc = dlist_container(PGPROC, links, dlist_pop_head_node(procgloballist));
+
 		SpinLockRelease(ProcStructLock);
 	}
 	else
@@ -1988,7 +2259,7 @@ ProcSendSignal(ProcNumber procNumber)
 	if (procNumber < 0 || procNumber >= ProcGlobal->allProcCount)
 		elog(ERROR, "procNumber out of range");
 
-	SetLatch(&ProcGlobal->allProcs[procNumber].procLatch);
+	SetLatch(&ProcGlobal->allProcs[procNumber]->procLatch);
 }
 
 /*
@@ -2063,3 +2334,60 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 
 	return ok;
 }
+
+/* copy from buf_init.c */
+static Size
+get_memory_page_size(void)
+{
+	Size		os_page_size;
+	Size		huge_page_size;
+
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	/*
+	 * XXX This is a bit annoying/confusing, because we may get a different
+	 * result depending on when we call it. Before mmap() we don't know if the
+	 * huge pages get used, so we assume they will. And then if we don't get
+	 * huge pages, we'll waste memory etc.
+	 */
+
+	/* assume huge pages get used, unless HUGE_PAGES_OFF */
+	if (huge_pages_status == HUGE_PAGES_OFF)
+		huge_page_size = 0;
+	else
+		GetHugePageSize(&huge_page_size, NULL);
+
+	return Max(os_page_size, huge_page_size);
+}
+
+/*
+ * move_to_node
+ *		move all pages in the given range to the requested NUMA node
+ *
+ * XXX This is expected to only process fairly small number of pages, so no
+ * need to do batching etc. Just move pages one by one.
+ */
+static void
+move_to_node(char *startptr, char *endptr, Size mem_page_size, int node)
+{
+	while (startptr < endptr)
+	{
+		int			r,
+					status;
+
+		r = numa_move_pages(0, 1, (void **) &startptr, &node, &status, 0);
+
+		if (r != 0)
+			elog(WARNING, "failed to move page to NUMA node %d (r = %d, status = %d)",
+				 node, r, status);
+
+		startptr += mem_page_size;
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 7febf3001a3..bf775c76545 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -149,6 +149,7 @@ int			MaxBackends = 0;
 bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 int			numa_partition_freelist = 0;
+bool		numa_procs_interleave = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e2361c161e6..930082588f2 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2144,6 +2144,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_interleave", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables NUMA interleaving of PGPROC entries."),
+			gettext_noop("When enabled, the PGPROC entries are interleaved to all NUMA nodes."),
+		},
+		&numa_procs_interleave,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 17528439f07..f454b4e9d75 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -181,6 +181,7 @@ extern PGDLLIMPORT int max_parallel_workers;
 extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT int numa_partition_freelist;
+extern PGDLLIMPORT bool numa_procs_interleave;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5cb1632718e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,6 +194,8 @@ struct PGPROC
 								 * vacuum must not remove tuples deleted by
 								 * xid >= xmin ! */
 
+	int			procnumber;		/* index in ProcGlobal->allProcs */
+
 	int			pid;			/* Backend's process ID; 0 if prepared xact */
 
 	int			pgxactoff;		/* offset into various ProcGlobal->arrays with
@@ -319,6 +321,9 @@ struct PGPROC
 	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a member */
 	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
 	dlist_node	lockGroupLink;	/* my member link, if I'm a member */
+
+	/* NUMA node */
+	int			numa_node;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -383,7 +388,7 @@ extern PGDLLIMPORT PGPROC *MyProc;
 typedef struct PROC_HDR
 {
 	/* Array of PGPROC structures (not including dummies for prepared txns) */
-	PGPROC	   *allProcs;
+	PGPROC	  **allProcs;
 
 	/* Array mirroring PGPROC.xid for each PGPROC currently in the procarray */
 	TransactionId *xids;
@@ -435,8 +440,8 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
 /*
  * Accessors for getting PGPROC given a ProcNumber and vice versa.
  */
-#define GetPGProcByNumber(n) (&ProcGlobal->allProcs[(n)])
-#define GetNumberFromPGProc(proc) ((proc) - &ProcGlobal->allProcs[0])
+#define GetPGProcByNumber(n) (ProcGlobal->allProcs[(n)])
+#define GetNumberFromPGProc(proc) ((proc)->procnumber)
 
 /*
  * We set aside some extra PGPROC structures for "special worker" processes,
-- 
2.49.0

From f76377a56f37421c61c4dd876813b57084b019df Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Tue, 27 May 2025 23:08:48 +0200
Subject: [PATCH v1 6/6] NUMA: pin backends to NUMA nodes

When initializing the backend, we pick a PGPROC entry from the right
NUMA node where the backend is running. But the process can move to a
different core / node, so to prevent that we pin it.
---
 src/backend/storage/lmgr/proc.c     | 21 +++++++++++++++++++++
 src/backend/utils/init/globals.c    |  1 +
 src/backend/utils/misc/guc_tables.c | 10 ++++++++++
 src/include/miscadmin.h             |  1 +
 4 files changed, 33 insertions(+)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9d3e94a7b3a..4c9e55608b2 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -729,6 +729,27 @@ InitProcess(void)
 	}
 	MyProcNumber = GetNumberFromPGProc(MyProc);
 
+	/*
+	 * Optionally, restrict the process to only run on CPUs from the same NUMA
+	 * as the PGPROC. We do this even if the PGPROC has a different NUMA node,
+	 * but not for PGPROC entries without a node (i.e. aux/2PC entries).
+	 *
+	 * This also means we only do this with numa_procs_interleave, because
+	 * without that we'll have numa_node=-1 for all PGPROC entries.
+	 *
+	 * FIXME add proper error-checking for libnuma functions
+	 */
+	if (numa_procs_pin && MyProc->numa_node != -1)
+	{
+		struct bitmask *cpumask = numa_allocate_cpumask();
+
+		numa_node_to_cpus(MyProc->numa_node, cpumask);
+
+		numa_sched_setaffinity(MyProcPid, cpumask);
+
+		numa_free_cpumask(cpumask);
+	}
+
 	/*
 	 * Cross-check that the PGPROC is of the type we expect; if this were not
 	 * the case, it would get returned to the wrong list.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index bf775c76545..e584ba840ef 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -150,6 +150,7 @@ bool		numa_buffers_interleave = false;
 bool		numa_localalloc = false;
 int			numa_partition_freelist = 0;
 bool		numa_procs_interleave = false;
+bool		numa_procs_pin = false;
 
 /* GUC parameters for vacuum */
 int			VacuumBufferUsageLimit = 2048;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 930082588f2..3fc8897ae36 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2154,6 +2154,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"numa_procs_pin", PGC_POSTMASTER, DEVELOPER_OPTIONS,
+			gettext_noop("Enables pinning backends to NUMA nodes (matching the PGPROC node)."),
+			gettext_noop("When enabled, sets affinity to CPUs from the same NUMA node."),
+		},
+		&numa_procs_pin,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"sync_replication_slots", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Enables a physical standby to synchronize logical failover replication slots from the primary server."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f454b4e9d75..d0d960caa9d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT bool numa_buffers_interleave;
 extern PGDLLIMPORT bool numa_localalloc;
 extern PGDLLIMPORT int numa_partition_freelist;
 extern PGDLLIMPORT bool numa_procs_interleave;
+extern PGDLLIMPORT bool numa_procs_pin;
 
 extern PGDLLIMPORT int commit_timestamp_buffers;
 extern PGDLLIMPORT int multixact_member_buffers;
-- 
2.49.0

Adding basic NUMA awareness

Reply via email to