On Wed, 8 Jan 2025 at 12:32, Jeff Davis <pg...@j-davis.com> wrote: > I committed the earlier cleanup patches and rebased the rest. Attached.
While I do understand the desire to reduce Hash Agg's memory usage, has this really been through enough performance testing to be getting committed? I looked at the changes e0ece2a98 made to LookupTupleHashEntry_internal() and saw that you're doing repalloc() on the memory for the MinimalTuple just after the ExecCopySlotMinimalTuple() pallocs that memory. Some of these tuples might be quite big and I see your test case has a single INT4 GROUP BY, in which case the tuple is narrow. Remember this code is also used for set operations, where the tuples to compare might be quite wide. Since AllocSet rounds non-external chunks up to the next power-of-2, many repallocs() won't require a new chunk as there's likely to be space in the existing chunk, but I still seem to be able to measure this even when the memcpy() to populate the new chunk from the old one isn't needed. For example: create table a (a text not null); insert into a select repeat(md5(a::text),10) from generate_Series(1,1000) a; vacuum freeze analyze a; groupbya.sql: select a,count(*) from a group by a; psql -c "select pg_prewarm('a');" postgres > /dev/null && for i in {1..3}; do pgbench -n -f groupbya.sql -T 10 -M prepared postgres | grep tps; done Before (34c6e6524): tps = 1197.271513 (without initial connection time) tps = 1201.798286 (without initial connection time) tps = 1189.191958 (without initial connection time) After (e0ece2a98): tps = 1036.424401 (without initial connection time) tps = 1094.528577 (without initial connection time) tps = 1067.820026 (without initial connection time) 12% slower. Or if I craft the tuple so that the repalloc() needs to switch from a 256 byte to 512 byte chunk, I get: create table b (b text not null); insert into b select left(repeat(md5(b::text),8),236) from generate_Series(1,1000) b; vacuum freeze analyze b; groupbyb.sql: select b,count(*) from b group by b; psql -c "select pg_prewarm('b');" postgres && for i in {1..3}; do pgbench -n -f groupbyb.sql -T 10 -M prepared postgres | grep tps; done Before (34c6e6524) tps = 1258.542072 (without initial connection time) tps = 1260.423425 (without initial connection time) tps = 1244.229721 (without initial connection time) After (e0ece2a98): tps = 1088.773442 (without initial connection time) tps = 1068.253603 (without initial connection time) tps = 1121.814669 (without initial connection time) 14% slower. While I understand that this likely helps for when the hashtable size is larger than L3 and also when HashAgg is spilling to disk. I don't think that making that faster at the expense of slowing down workloads that fit into memory does not seem like a nice trade-off. I wonder if there's some other better way of doing this. Would it be worth having some function like ExecCopySlotMinimalTuple() that accepts an additional parameter so that the palloc allocates N more bytes at the end? Maybe it's worth hunting around to see if there's any other executor nodes that could benefit from that infrastructure. The other problem you've created by doing the repalloc() is that, if you still want the patch you posted in [1] then the repalloc() can't be done since bump context don't allow it. It would be quite nice if there were some way to have both of these optimisations in a way that didn't need any repalloc(). David [1] https://postgr.es/m/3d9b8f7609039c5775cc8272f09054faea794c66.ca...@j-davis.com