Re: Reduce TupleHashEntryData struct size by half

David Rowley Sat, 11 Jan 2025 17:55:28 -0800

On Wed, 8 Jan 2025 at 12:32, Jeff Davis <pg...@j-davis.com> wrote:
> I committed the earlier cleanup patches and rebased the rest. Attached.


While I do understand the desire to reduce Hash Agg's memory usage,
has this really been through enough performance testing to be getting
committed?

I looked at the changes e0ece2a98 made to
LookupTupleHashEntry_internal() and saw that you're doing repalloc()
on the memory for the MinimalTuple just after the
ExecCopySlotMinimalTuple() pallocs that memory. Some of these tuples
might be quite big and I see your test case has a single INT4 GROUP
BY, in which case the tuple is narrow. Remember this code is also used
for set operations, where the tuples to compare might be quite wide.
Since AllocSet rounds non-external chunks up to the next power-of-2,
many repallocs() won't require a new chunk as there's likely to be
space in the existing chunk, but I still seem to be able to measure
this even when the memcpy() to populate the new chunk from the old one
isn't needed.

For example:

create table a (a text not null);
insert into a select repeat(md5(a::text),10) from generate_Series(1,1000) a;
vacuum freeze analyze a;

groupbya.sql:
select a,count(*) from a group by a;

psql -c "select pg_prewarm('a');" postgres > /dev/null && for i in
{1..3}; do pgbench -n -f groupbya.sql -T 10 -M prepared postgres |
grep tps; done

Before (34c6e6524):
tps = 1197.271513 (without initial connection time)
tps = 1201.798286 (without initial connection time)
tps = 1189.191958 (without initial connection time)

After (e0ece2a98):
tps = 1036.424401 (without initial connection time)
tps = 1094.528577 (without initial connection time)
tps = 1067.820026 (without initial connection time)

12% slower.

Or if I craft the tuple so that the repalloc() needs to switch from a
256 byte to 512 byte chunk, I get:

create table b (b text not null);
insert into b select left(repeat(md5(b::text),8),236) from
generate_Series(1,1000) b;
vacuum freeze analyze b;

groupbyb.sql:
select b,count(*) from b group by b;

psql -c "select pg_prewarm('b');" postgres && for i in {1..3}; do
pgbench -n -f groupbyb.sql -T 10 -M prepared postgres | grep tps; done

Before (34c6e6524)
tps = 1258.542072 (without initial connection time)
tps = 1260.423425 (without initial connection time)
tps = 1244.229721 (without initial connection time)

After (e0ece2a98):
tps = 1088.773442 (without initial connection time)
tps = 1068.253603 (without initial connection time)
tps = 1121.814669 (without initial connection time)

14% slower.

While I understand that this likely helps for when the hashtable size
is larger than L3 and also when HashAgg is spilling to disk. I don't
think that making that faster at the expense of slowing down workloads
that fit into memory does not seem like a nice trade-off.

I wonder if there's some other better way of doing this. Would it be
worth having some function like ExecCopySlotMinimalTuple() that
accepts an additional parameter so that the palloc allocates N more
bytes at the end?  Maybe it's worth hunting around to see if there's
any other executor nodes that could benefit from that infrastructure.
The other problem you've created by doing the repalloc() is that, if
you still want the patch you posted in [1] then the repalloc() can't
be done since bump context don't allow it.

It would be quite nice if there were some way to have both of these
optimisations in a way that didn't need any repalloc().

David

[1] 
https://postgr.es/m/3d9b8f7609039c5775cc8272f09054faea794c66.ca...@j-davis.com

Re: Reduce TupleHashEntryData struct size by half

Reply via email to