Optimizing ResouceOwner to speed up COPY

Tomas Vondra Fri, 17 Oct 2025 16:35:38 -0700

Hi,

While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).


The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.

The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.

With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.

This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...

There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.

The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).

This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.

To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.

                           master                patched
    rows    |       1       4       8  |       1       4       8
   ---------|--------------------------|-------------------------
    10      |  112090  418421  594905  |  112871  419043  589646
    100     |   35265  121903  168238  |   42536  138578  183740
    1000    |    1450    5700   10725  |    5847   21594   30713
    10000   |     531    1943    2988  |     743    2619    3557
    100000  |      72     244     324  |      76     256     332

Or, as relative throughput:

                    patched / master
    rows    |         1        4        8
   ---------------------------------------
    10      |      101%     100%      99%
    100     |      121%     114%     109%
    1000    |      403%     379%     286%
    10000   |      140%     135%     119%
    100000  |      106%     105%     102%

Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.

I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.

I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.


regards

-- 
Tomas Vondra

From a8ef30e79b8cfc85d3c5fa393a11b28e304515f5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 16 Oct 2025 13:27:25 +0200
Subject: [PATCH v20251016] Deduplicate entries in ResourceOwner

We may track the same resource many times, but the current ResourceOwner
does not handle that efficiently. Each reference is a separate entry,
which leads to long linear searches in the simple hash table.

This adds a simple opportunistic deduplication. Entries get a counter,
tracking how many references they represent. It only applies to the
"hash" phase, not to the initial array. And when adding an entry to the
hash finds an empty slot first, it uses it (instead of continuing to
look for a possible match).

This means the deduplication is not perfect - there may be multiple
entries for the same value. But in practice it seems like a good trade
off. It won't affect cases with no duplicates, and it will help cases
with many of them.

The code also continues to grow the hash table as before. In some cases
it might be enough to deduplicate the contents. But it's hard to say in
advance, and it's not much cheaper as it still requires walking the
current hash table. So there's a risk of attempting deduplication, only
to find the hash still needs to be enlarged, doubling the cost. So we
just grow the hash table. In the worst case the memory usage is not
worse than without deduplication. If deduplication is effective, the
hash table grows later or not at all.
---
 src/backend/utils/resowner/resowner.c | 55 ++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index fca84ded6dd..27127c99e40 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -60,10 +60,15 @@
  *
  * All objects managed by this code are required to fit into a Datum,
  * which is fine since they are generally pointers or integers.
+ *
+ * We may see items multiple times (e.g. tuple descriptors when creating
+ * multiple tuple slots). To handle these cases efficiently we deduplicate
+ * the entries, and track the number of occurrences.
  */
 typedef struct ResourceElem
 {
 	Datum		item;
+	uint32		count;				/* number of occurrences */
 	const ResourceOwnerDesc *kind;	/* NULL indicates a free hash table slot */
 } ResourceElem;
 
@@ -198,7 +203,8 @@ static ResourceReleaseCallbackItem *ResourceRelease_callbacks = NULL;
 /* Internal routines */
 static inline uint32 hash_resource_elem(Datum value, const ResourceOwnerDesc *kind);
 static void ResourceOwnerAddToHash(ResourceOwner owner, Datum value,
-								   const ResourceOwnerDesc *kind);
+								   const ResourceOwnerDesc *kind,
+								   uint32 count);
 static int	resource_priority_cmp(const void *a, const void *b);
 static void ResourceOwnerSort(ResourceOwner owner);
 static void ResourceOwnerReleaseAll(ResourceOwner owner,
@@ -239,7 +245,8 @@ hash_resource_elem(Datum value, const ResourceOwnerDesc *kind)
  * Adds 'value' of given 'kind' to the ResourceOwner's hash table
  */
 static void
-ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
+ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind,
+					   uint32 count)
 {
 	uint32		mask = owner->capacity - 1;
 	uint32		idx;
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = hash_resource_elem(value, kind) & mask;
 	for (;;)
 	{
+		/* found an exact match - just increment the counter */
+		if ((owner->hash[idx].kind == kind) &&
+			(owner->hash[idx].item == value))
+		{
+			owner->hash[idx].count += count;
+			return;
+		}
+
 		if (owner->hash[idx].kind == NULL)
 			break;				/* found a free slot */
 		idx = (idx + 1) & mask;
 	}
 	owner->hash[idx].item = value;
 	owner->hash[idx].kind = kind;
+	owner->hash[idx].count = count;
 	owner->nhash++;
 }
 
@@ -377,6 +393,7 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 		uint32		idx = nitems - 1;
 		Datum		value = items[idx].item;
 		const ResourceOwnerDesc *kind = items[idx].kind;
+		uint32		count = items[idx].count;
 
 		if (kind->release_phase > phase)
 			break;
@@ -392,7 +409,14 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 			elog(WARNING, "resource was not closed: %s", res_str);
 			pfree(res_str);
 		}
-		kind->ReleaseResource(value);
+
+		/* invoke the release callback for each ocurrence */
+		while (count > 0)
+		{
+			kind->ReleaseResource(value);
+			count--;
+		}
+
 		nitems--;
 	}
 	if (owner->nhash == 0)
@@ -496,7 +520,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 			for (i = 0; i < oldcap; i++)
 			{
 				if (oldhash[i].kind != NULL)
-					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind);
+					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind,
+										   oldhash[i].count);
 			}
 
 			/* And release old hash table. */
@@ -506,7 +531,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 
 	/* Move items from the array to the hash */
 	for (int i = 0; i < owner->narr; i++)
-		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind);
+		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind,
+							   owner->arr[i].count);
 	owner->narr = 0;
 
 	Assert(owner->nhash <= owner->grow_at);
@@ -543,6 +569,7 @@ ResourceOwnerRemember(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = owner->narr;
 	owner->arr[idx].item = value;
 	owner->arr[idx].kind = kind;
+	owner->arr[idx].count = 1;
 	owner->narr++;
 }
 
@@ -597,8 +624,17 @@ ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *k
 			if (owner->hash[idx].item == value &&
 				owner->hash[idx].kind == kind)
 			{
+				/* an entry representing multiple items, just decrement */
+				if (owner->hash[idx].count > 1)
+				{
+					owner->hash[idx].count--;
+					return;
+				}
+
+				/* remove the entry entirely */
 				owner->hash[idx].item = (Datum) 0;
 				owner->hash[idx].kind = NULL;
+				owner->hash[idx].count = 0;
 				owner->nhash--;
 
 #ifdef RESOWNER_STATS
@@ -847,12 +883,19 @@ ResourceOwnerReleaseAllOfKind(ResourceOwner owner, const ResourceOwnerDesc *kind
 		if (owner->hash[i].kind == kind)
 		{
 			Datum		value = owner->hash[i].item;
+			uint32		count = owner->hash[i].count;
 
 			owner->hash[i].item = (Datum) 0;
 			owner->hash[i].kind = NULL;
+			owner->hash[i].count = 0;
 			owner->nhash--;
 
-			kind->ReleaseResource(value);
+			/* invoke the release callback once for each entry */
+			while (count > 0)
+			{
+				kind->ReleaseResource(value);
+				count--;
+			}
 		}
 	}
 	owner->releasing = false;
-- 
2.51.0

Optimizing ResouceOwner to speed up COPY

Reply via email to