Hey Thomas, On Fri, Jul 10, 2020 at 7:30 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > > I could reproduce the speed gain that you saw for a plan with a simple > > parallel sequential scan. However, I got no gain at all for a parallel > > hash join and parallel agg query. > > Right, it's not going to make a difference when you only send one > tuple through the queue, like COUNT(*) does.
How silly of me! I should have paid more attention to the rows output from each worker and that there was a select count(*) on the join query. Anyway, these are a new set of results: ----------------------------------------------------------------------- select pg_prewarm('lineitem'); select pg_prewarm('orders'); -- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G explain analyze select * from lineitem join orders on l_orderkey = o_orderkey where o_totalprice > 5.00; [w/o any patch] 637s [w/ first patch] 635s [w/ last minimal tuple patch] 568s ----------------------------------------------------------------------- We do indeed get the speedup. > > As for gather merge, is it possible to have a situation where the slot > > input to tqueueReceiveSlot() is a heap slot (as would be the case for a > > simple select *)? If yes, in those scenarios, we would be incurring an > > extra call to minimal_tuple_from_heap_tuple() because of the extra call > > to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch. > > And since, in a gather merge, we can't avoid the copy on the leader side > > (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be > > doing extra work in that scenario. I couldn't come up with a plan that > > creates a scenario like this however. > > Hmm. I wish we had a way to do an "in-place" copy-to-minimal-tuple > where the caller supplies the memory, with some fast protocol to get > the size right. We could use that for copying tuples into shm queues, > hash join tables etc without an extra palloc()/pfree() and double > copy. Do you mean that we should have an implementation for get_minimal_tuple() for the heap AM and have it return a pointer to the minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as tqueueReceiveSlot() will ensure that the heap tuple from which it wants to extract the minimal tuple was allocated in the tuple queue in the first place? If we consider that the node directly below a gather is a SeqScan, then we could possibly, in ExecInitSeqScan() set-up the ss_ScanTupleSlot to point to memory in the shared tuple queue? Similarly, other ExecInit*() methods can do the same for other executor nodes that involve parallelism? Of course, things would be slightly different for the other use cases you mentioned (such as hash table population) All things considered, I think the patch in its current form should go in. Having the in-place copy, could be done as a separate patch? Do you agree? Regards, Soumyadeep (VMware)