On 10/29/25 01:05, Tomas Vondra wrote:
> ...
>> Yeah, I definitely want to protect against this. I believe similar
> failures can happen even with much lower m_w_m values (possibly ~2-3GB),
> although only with weird/skewed data sets. AFAICS a constant
> single-element array would trigger this, but I haven't tested that.
> 
> Serial builds can fail with large maintenance_work_mem too, like this:
> 
>   ERROR: posting list is too long
>   HINT: Reduce "maintenance_work_mem".
> 
> but it's deterministic, and it's actually a proper error message, not
> just some weird "invalid alloc size".
> 
> Attached is a v3 of the patch series. 0001 and 0002 were already posted,
> and I believe either of those would address the issue. 0003 is more of
> an optimization, further reducing the memory usage.
> 
> I'm putting this through additional testing, which takes time. But it
> seems there's still some loose end in 0001, as I just got the "invalid
> alloc request" failure with it applied ... I'll take a look tomorrow.
> 

Unsurprisingly, there were a couple more palloc/repalloc calls (in
ginPostingListDecodeAllSegments) that could fail with long TID lists
produced when merging worker data. The attached v4 fixes this.

However, I see this as a sign that allowing huge allocations is not the
right way to fix this. The GIN code generally assumes, and I don't think
reworking this in a bugfix seems a bit too invasive. And I'm not really
certain this is the last place that could hit this.

Another argument against 0001 is using more memory does not really help
anything. It's not any faster or simpler. It's more like "let's use the
memory we have" rather than "let's use the memory we need".

So I'm planning to get rid of 0001, and fix that by 0002 or 0002+0003.
That seems like a better and (unexpectedly) less invasive fix.


regards

-- 
Tomas Vondra
From 27a68b43636233952bff8aefce08112194b39d40 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sun, 26 Oct 2025 21:14:44 +0100
Subject: [PATCH v4 1/3] Allow parallel GIN builds to allocate large chunks

The parallel GIN builds used palloc/repalloc to maintain TID lists,
which with high maintance_work_mem values can lead to failures like

    ERROR:  invalid memory alloc request size 1113001620

The reason is that while merging intermediate worker data, we call
GinBufferStoreTuple() which coalesces the TID lists, and the result
may not fit into MaxAllocSize.

Fixed by allowing huge allocations when merging TID lists, including
an existing palloc call in ginMergeItemPointers().

Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.

Reported-by: Gregory Smith <[email protected]>
Discussion: https://postgr.es/m/cahljucwdwn-pe2bmze4kux7x5wwt_6rowta0muqffedlez6...@mail.gmail.com
Backpatch-through: 18
---
 src/backend/access/gin/gininsert.c      |  7 ++++---
 src/backend/access/gin/ginpostinglist.c | 10 ++++++----
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..2355b96b351 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1496,10 +1496,11 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 		 * still pass 0 as number of elements in that array though.
 		 */
 		if (buffer->items == NULL)
-			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+			buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+											MCXT_ALLOC_HUGE);
 		else
-			buffer->items = repalloc(buffer->items,
-									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+			buffer->items = repalloc_huge(buffer->items,
+										  (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
 
 		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
 								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 48eadec87b0..a60ea46204a 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -308,7 +308,8 @@ ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_
 	 * Guess an initial size of the array.
 	 */
 	nallocated = segment->nbytes * 2 + 1;
-	result = palloc(nallocated * sizeof(ItemPointerData));
+	result = palloc_extended(nallocated * sizeof(ItemPointerData),
+							 MCXT_ALLOC_HUGE);
 
 	ndecoded = 0;
 	while ((char *) segment < endseg)
@@ -317,7 +318,7 @@ ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_
 		if (ndecoded >= nallocated)
 		{
 			nallocated *= 2;
-			result = repalloc(result, nallocated * sizeof(ItemPointerData));
+			result = repalloc_huge(result, nallocated * sizeof(ItemPointerData));
 		}
 
 		/* copy the first item */
@@ -335,7 +336,7 @@ ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_
 			if (ndecoded >= nallocated)
 			{
 				nallocated *= 2;
-				result = repalloc(result, nallocated * sizeof(ItemPointerData));
+				result = repalloc_huge(result, nallocated * sizeof(ItemPointerData));
 			}
 
 			val += decode_varbyte(&ptr);
@@ -381,7 +382,8 @@ ginMergeItemPointers(ItemPointerData *a, uint32 na,
 {
 	ItemPointerData *dst;
 
-	dst = (ItemPointer) palloc((na + nb) * sizeof(ItemPointerData));
+	dst = (ItemPointer) palloc_extended((na + nb) * sizeof(ItemPointerData),
+										MCXT_ALLOC_HUGE);
 
 	/*
 	 * If the argument arrays don't overlap, we can just append them to each
-- 
2.51.0

From 21c64c4e036eea0e2bfa94bc8517de5b2bbed3b3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sun, 26 Oct 2025 21:23:37 +0100
Subject: [PATCH v4 2/3] Split TID lists during parallel GIN build

When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.

The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). We limit
the chunk size so that memory usage does not exceed MaxAllocSize (1GB).

This is desirable even with huge allocations allowed. Larger chunks do
not improve performance, so that the increased memory usage is useless.

Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.

Reported-by: Gregory Smith <[email protected]>
Discussion: https://postgr.es/m/cahljucwdwn-pe2bmze4kux7x5wwt_6rowta0muqffedlez6...@mail.gmail.com
Backpatch-through: 18
---
 src/backend/access/gin/gininsert.c | 48 +++++++++++++++++++++++-------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2355b96b351..d15e9a0cb0b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -152,7 +152,9 @@ typedef struct
 	 * only in the leader process.
 	 */
 	GinLeader  *bs_leader;
-	int			bs_worker_id;
+
+	/* number of participating workers (including leader) */
+	int			bs_num_workers;
 
 	/* used to pass information from workers to leader */
 	double		bs_numtuples;
@@ -483,6 +485,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 /*
  * ginFlushBuildState
  *		Write all data from BuildAccumulator into the tuplesort.
+ *
+ * The number of TIDs written to the tuplesort at once is limited, to reduce
+ * the amount of memory needed when merging the intermediate results later.
+ * The leader will see up to two chunks per worker, so calculate the limit to
+ * not need more than MaxAllocSize overall.
  */
 static void
 ginFlushBuildState(GinBuildState *buildstate, Relation index)
@@ -493,6 +500,11 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 	uint32		nlist;
 	OffsetNumber attnum;
 	TupleDesc	tdesc = RelationGetDescr(index);
+	uint32		maxlen;
+
+	/* maximum number of TIDs per chunk (two chunks per worker) */
+	maxlen = MaxAllocSize / sizeof(ItemPointerData);
+	maxlen /= (2 * buildstate->bs_num_workers);
 
 	ginBeginBAScan(&buildstate->accum);
 	while ((list = ginGetBAEntry(&buildstate->accum,
@@ -501,20 +513,31 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 		/* information about the key */
 		CompactAttribute *attr = TupleDescCompactAttr(tdesc, (attnum - 1));
 
-		/* GIN tuple and tuple length */
-		GinTuple   *tup;
-		Size		tuplen;
+		/* start of the chunk */
+		uint32		offset = 0;
 
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
+		/* split the entry into smaller chunk with up to maxlen items */
+		while (offset < nlist)
+		{
+			/* GIN tuple and tuple length */
+			GinTuple   *tup;
+			Size		tuplen;
+			uint32		len = Min(maxlen, nlist - offset);
 
-		tup = _gin_build_tuple(attnum, category,
-							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   &list[offset], len,
+								   &tuplen);
+
+			offset += maxlen;
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
-		pfree(tup);
+			pfree(tup);
+		}
 	}
 
 	MemoryContextReset(buildstate->tmpCtx);
@@ -2018,6 +2041,9 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* remember how much space is allowed for the accumulated entries */
 	state->work_mem = (sortmem / 2);
 
+	/* remember how many workers participate in the build */
+	state->bs_num_workers = ginshared->scantuplesortstates;
+
 	/* Begin "partial" tuplesort */
 	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
 													state->work_mem,
-- 
2.51.0

From d14cdb4bf70bc30e6e3757b70ffa23c7d202a443 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 27 Oct 2025 23:58:10 +0100
Subject: [PATCH v4 3/3] Trim TIDs during parallel GIN builds more eagerly

The code frozen the beginning of TID lists only when merging the lists,
which means we'd only trim the list when adding the next chunk. But we
can do better - we can update the number of frozen items earlier.

This is not expected to make a huge difference, but it can't hurt and
it's virtually free.

Discussion: https://postgr.es/m/cahljucwdwn-pe2bmze4kux7x5wwt_6rowta0muqffedlez6...@mail.gmail.com
---
 src/backend/access/gin/gininsert.c | 129 ++++++++++++++---------------
 1 file changed, 64 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d15e9a0cb0b..37d72811b9a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1401,6 +1401,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 static bool
 GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 {
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
 	if (buffer->nfrozen < 1024)
 		return false;
@@ -1445,6 +1487,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 	ItemPointerData *items;
 	Datum		key;
 
+	int			nnew;
+	ItemPointer new;
+
 	AssertCheckGinBuffer(buffer);
 
 	key = _gin_parse_tuple_key(tup);
@@ -1466,80 +1511,34 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/*
-	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
-	 * the mergesort. We can do that with TIDs before the first TID in the new
-	 * tuple we're about to add into the buffer.
-	 *
-	 * We do this incrementally when adding data into the in-memory buffer,
-	 * and not later (e.g. when hitting a memory limit), because it allows us
-	 * to skip the frozen data during the mergesort, making it cheaper.
-	 */
-
-	/*
-	 * Check if the last TID in the current list is frozen. This is the case
-	 * when merging non-overlapping lists, e.g. in each parallel worker.
-	 */
-	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
-							GinTupleGetFirst(tup)) == 0))
-		buffer->nfrozen = buffer->nitems;
+	/* add the new TIDs into the buffer, combine using merge-sort */
 
 	/*
-	 * Now find the last TID we know to be frozen, i.e. the last TID right
-	 * before the new GIN tuple.
-	 *
-	 * Start with the first not-yet-frozen tuple, and walk until we find the
-	 * first TID that's higher. If we already know the whole list is frozen
-	 * (i.e. nfrozen == nitems), this does nothing.
-	 *
-	 * XXX This might do a binary search for sufficiently long lists, but it
-	 * does not seem worth the complexity. Overlapping lists should be rare
-	 * common, TID comparisons are cheap, and we should quickly freeze most of
-	 * the list.
+	 * Resize the array - we do this first, because we'll dereference the
+	 * first unfrozen TID, which would fail if the array is NULL. We'll still
+	 * pass 0 as number of elements in that array though.
 	 */
-	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
-	{
-		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i],
-							   GinTupleGetFirst(tup)) > 0)
-			break;
-
-		buffer->nfrozen++;
-	}
-
-	/* add the new TIDs into the buffer, combine using merge-sort */
-	{
-		int			nnew;
-		ItemPointer new;
-
-		/*
-		 * Resize the array - we do this first, because we'll dereference the
-		 * first unfrozen TID, which would fail if the array is NULL. We'll
-		 * still pass 0 as number of elements in that array though.
-		 */
-		if (buffer->items == NULL)
-			buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
-											MCXT_ALLOC_HUGE);
-		else
-			buffer->items = repalloc_huge(buffer->items,
-										  (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+	if (buffer->items == NULL)
+		buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+										MCXT_ALLOC_HUGE);
+	else
+		buffer->items = repalloc_huge(buffer->items,
+									  (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
 
-		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
-								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
-								   items, tup->nitems, &nnew);
+	new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
+							   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
+							   items, tup->nitems, &nnew);
 
-		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+	Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		memcpy(&buffer->items[buffer->nfrozen], new,
-			   nnew * sizeof(ItemPointerData));
+	memcpy(&buffer->items[buffer->nfrozen], new,
+		   nnew * sizeof(ItemPointerData));
 
-		pfree(new);
+	pfree(new);
 
-		buffer->nitems += tup->nitems;
+	buffer->nitems += tup->nitems;
 
-		AssertCheckItemPointers(buffer);
-	}
+	AssertCheckItemPointers(buffer);
 
 	/* free the decompressed TID list */
 	pfree(items);
-- 
2.51.0

Reply via email to