Hi,
While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).
The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.
The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.
With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.
This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...
There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.
This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.
The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).
This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.
To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.
master patched
rows | 1 4 8 | 1 4 8
---------|--------------------------|-------------------------
10 | 112090 418421 594905 | 112871 419043 589646
100 | 35265 121903 168238 | 42536 138578 183740
1000 | 1450 5700 10725 | 5847 21594 30713
10000 | 531 1943 2988 | 743 2619 3557
100000 | 72 244 324 | 76 256 332
Or, as relative throughput:
patched / master
rows | 1 4 8
---------------------------------------
10 | 101% 100% 99%
100 | 121% 114% 109%
1000 | 403% 379% 286%
10000 | 140% 135% 119%
100000 | 106% 105% 102%
Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.
I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.
I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.
regards
--
Tomas VondraFrom a8ef30e79b8cfc85d3c5fa393a11b28e304515f5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 16 Oct 2025 13:27:25 +0200
Subject: [PATCH v20251016] Deduplicate entries in ResourceOwner
We may track the same resource many times, but the current ResourceOwner
does not handle that efficiently. Each reference is a separate entry,
which leads to long linear searches in the simple hash table.
This adds a simple opportunistic deduplication. Entries get a counter,
tracking how many references they represent. It only applies to the
"hash" phase, not to the initial array. And when adding an entry to the
hash finds an empty slot first, it uses it (instead of continuing to
look for a possible match).
This means the deduplication is not perfect - there may be multiple
entries for the same value. But in practice it seems like a good trade
off. It won't affect cases with no duplicates, and it will help cases
with many of them.
The code also continues to grow the hash table as before. In some cases
it might be enough to deduplicate the contents. But it's hard to say in
advance, and it's not much cheaper as it still requires walking the
current hash table. So there's a risk of attempting deduplication, only
to find the hash still needs to be enlarged, doubling the cost. So we
just grow the hash table. In the worst case the memory usage is not
worse than without deduplication. If deduplication is effective, the
hash table grows later or not at all.
---
src/backend/utils/resowner/resowner.c | 55 ++++++++++++++++++++++++---
1 file changed, 49 insertions(+), 6 deletions(-)
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index fca84ded6dd..27127c99e40 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -60,10 +60,15 @@
*
* All objects managed by this code are required to fit into a Datum,
* which is fine since they are generally pointers or integers.
+ *
+ * We may see items multiple times (e.g. tuple descriptors when creating
+ * multiple tuple slots). To handle these cases efficiently we deduplicate
+ * the entries, and track the number of occurrences.
*/
typedef struct ResourceElem
{
Datum item;
+ uint32 count; /* number of occurrences */
const ResourceOwnerDesc *kind; /* NULL indicates a free hash table slot */
} ResourceElem;
@@ -198,7 +203,8 @@ static ResourceReleaseCallbackItem *ResourceRelease_callbacks = NULL;
/* Internal routines */
static inline uint32 hash_resource_elem(Datum value, const ResourceOwnerDesc *kind);
static void ResourceOwnerAddToHash(ResourceOwner owner, Datum value,
- const ResourceOwnerDesc *kind);
+ const ResourceOwnerDesc *kind,
+ uint32 count);
static int resource_priority_cmp(const void *a, const void *b);
static void ResourceOwnerSort(ResourceOwner owner);
static void ResourceOwnerReleaseAll(ResourceOwner owner,
@@ -239,7 +245,8 @@ hash_resource_elem(Datum value, const ResourceOwnerDesc *kind)
* Adds 'value' of given 'kind' to the ResourceOwner's hash table
*/
static void
-ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
+ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind,
+ uint32 count)
{
uint32 mask = owner->capacity - 1;
uint32 idx;
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
idx = hash_resource_elem(value, kind) & mask;
for (;;)
{
+ /* found an exact match - just increment the counter */
+ if ((owner->hash[idx].kind == kind) &&
+ (owner->hash[idx].item == value))
+ {
+ owner->hash[idx].count += count;
+ return;
+ }
+
if (owner->hash[idx].kind == NULL)
break; /* found a free slot */
idx = (idx + 1) & mask;
}
owner->hash[idx].item = value;
owner->hash[idx].kind = kind;
+ owner->hash[idx].count = count;
owner->nhash++;
}
@@ -377,6 +393,7 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
uint32 idx = nitems - 1;
Datum value = items[idx].item;
const ResourceOwnerDesc *kind = items[idx].kind;
+ uint32 count = items[idx].count;
if (kind->release_phase > phase)
break;
@@ -392,7 +409,14 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
elog(WARNING, "resource was not closed: %s", res_str);
pfree(res_str);
}
- kind->ReleaseResource(value);
+
+ /* invoke the release callback for each ocurrence */
+ while (count > 0)
+ {
+ kind->ReleaseResource(value);
+ count--;
+ }
+
nitems--;
}
if (owner->nhash == 0)
@@ -496,7 +520,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
for (i = 0; i < oldcap; i++)
{
if (oldhash[i].kind != NULL)
- ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind);
+ ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind,
+ oldhash[i].count);
}
/* And release old hash table. */
@@ -506,7 +531,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
/* Move items from the array to the hash */
for (int i = 0; i < owner->narr; i++)
- ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind);
+ ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind,
+ owner->arr[i].count);
owner->narr = 0;
Assert(owner->nhash <= owner->grow_at);
@@ -543,6 +569,7 @@ ResourceOwnerRemember(ResourceOwner owner, Datum value, const ResourceOwnerDesc
idx = owner->narr;
owner->arr[idx].item = value;
owner->arr[idx].kind = kind;
+ owner->arr[idx].count = 1;
owner->narr++;
}
@@ -597,8 +624,17 @@ ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *k
if (owner->hash[idx].item == value &&
owner->hash[idx].kind == kind)
{
+ /* an entry representing multiple items, just decrement */
+ if (owner->hash[idx].count > 1)
+ {
+ owner->hash[idx].count--;
+ return;
+ }
+
+ /* remove the entry entirely */
owner->hash[idx].item = (Datum) 0;
owner->hash[idx].kind = NULL;
+ owner->hash[idx].count = 0;
owner->nhash--;
#ifdef RESOWNER_STATS
@@ -847,12 +883,19 @@ ResourceOwnerReleaseAllOfKind(ResourceOwner owner, const ResourceOwnerDesc *kind
if (owner->hash[i].kind == kind)
{
Datum value = owner->hash[i].item;
+ uint32 count = owner->hash[i].count;
owner->hash[i].item = (Datum) 0;
owner->hash[i].kind = NULL;
+ owner->hash[i].count = 0;
owner->nhash--;
- kind->ReleaseResource(value);
+ /* invoke the release callback once for each entry */
+ while (count > 0)
+ {
+ kind->ReleaseResource(value);
+ count--;
+ }
}
}
owner->releasing = false;
--
2.51.0