Hi Thomas, +1 to the idea! I ran some experiments on both of your patches.
I could reproduce the speed gain that you saw for a plan with a simple parallel sequential scan. However, I got no gain at all for a parallel hash join and parallel agg query. ----------------------------------------------------------------------- select pg_prewarm('lineitem'); -- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G explain analyze select * from lineitem; [w/o any patch] 99s [w/ first patch] 89s [w/ last minimal tuple patch] 79s ----------------------------------------------------------------------- select pg_prewarm('lineitem'); -- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G explain analyze select count(*) from lineitem; [w/o any patch] 10s [w/ first patch] 10s [w/ last minimal tuple patch] 10s ----------------------------------------------------------------------- select pg_prewarm('lineitem'); select pg_prewarm('orders'); -- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G explain analyze select count(*) from lineitem join orders on l_orderkey = o_orderkey where o_totalprice > 5.00; [w/o any patch] 54s [w/ first patch] 53s [w/ last minimal tuple patch] 56s ----------------------------------------------------------------------- Maybe I'm missing something, since there should be improvements with anything that has a gather? As for gather merge, is it possible to have a situation where the slot input to tqueueReceiveSlot() is a heap slot (as would be the case for a simple select *)? If yes, in those scenarios, we would be incurring an extra call to minimal_tuple_from_heap_tuple() because of the extra call to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch. And since, in a gather merge, we can't avoid the copy on the leader side (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be doing extra work in that scenario. I couldn't come up with a plan that creates a scenario like this however. Regards, Soumyadeep (VMware)