On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <z...@yugabyte.com> wrote:
> > > On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbal...@gmail.com> > wrote: > >> On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmh...@gmail.com> >> wrote: >> >>> On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbal...@gmail.com> >>> wrote: >>> > Results: (query EXPLAIN ANALYZE SELECT * FROM t;) >>> > 1) Non-parallel (default) >>> > Execution Time: 31627.492 ms >>> > >>> > 2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0) >>> > Execution Time: 37498.672 ms >>> > >>> > 3) Same as above (2) but with the patch. >>> > Execution Time: 23649.287 ms >>> >>> This strikes me as an amazingly good result. I guess before seeing >>> these results, I would have said that you can't reasonably expect >>> parallel query to win on a query like this because there isn't enough >>> for the workers to do. It's not like they are spending time evaluating >>> filter conditions or anything like that - they're just fetching tuples >>> off of disk pages and sticking them into a queue. And it's unclear to >>> me why it should be better to have a bunch of processes doing that >>> instead of just one. I would have thought, looking at just (1) and >>> (2), that parallelism gained nothing and communication overhead lost 6 >>> seconds. >>> >>> But what this suggests is that parallelism gained at least 8 seconds, >>> and communication overhead lost at least 14 seconds. In fact... >>> >> >> Right, good observation. >> >> >>> > - If I apply both Experiment#1 and Experiment#2 patches together then, >>> > we can further reduce the execution time to 20963.539 ms (with 4 >>> > workers and 4MB tuple queue size) >>> >>> ...this suggests that parallelism actually gained at least 10-11 >>> seconds, and the communication overhead lost at least 15-16 seconds. >>> >> >> Yes >> >> >>> If that's accurate, it's pretty crazy. We might need to drastically >>> reduce the value of parallel_tuple_cost if these results hold up and >>> this patch gets committed. >>> >> >> In one of my experiments[Test1] I have noticed that even on the head the >> force parallel plan is significantly faster compared to the non-parallel >> plan, but with patch it is even better. The point is now also there might >> be some cases where the force parallel plans are faster but we are not sure >> whether we can reduce the parallel_tuple_cost or not. But with the patch >> it is definitely sure that the parallel tuple queue is faster compared to >> what we have now, So I agree we should consider reducing the >> parallel_tuple_cost after this patch. >> >> Additionally, I've done some more experiments with artificial workloads, >> as well as workloads where the parallel plan is selected by default, and in >> all cases I've seen a significant improvement. The gain is directly >> proportional to the load on the tuple queue, as expected. >> >> Test1: (Worker returns all tuples but only few tuples returns to the >> client) >> ---------------------------------------------------- >> INSERT INTO t SELECT i%10, repeat('a', 200) from >> generate_series(1,200000000) as i; >> set max_parallel_workers_per_gather=4; >> >> Target Query: SELECT random() FROM t GROUP BY a; >> >> Non-parallel (default plan): 77170.421 ms >> Parallel (parallel_tuple_cost=0): 53794.324 ms >> Parallel with patch (parallel_tuple_cost=0): 42567.850 ms >> >> 20% gain compared force parallel, 45% gain compared to default plan. >> >> Test2: (Parallel case with default parallel_tuple_cost) >> ---------------------------------------------- >> INSERT INTO t SELECT i, repeat('a', 200) from >> generate_series(1,200000000) as i; >> >> set max_parallel_workers_per_gather=4; >> SELECT * from t WHERE a < 17500000; >> Parallel(default plan): 23730.054 ms >> Parallel with patch (default plan): 21614.251 ms >> >> 8 to 10 % gain compared to the default parallel plan. >> >> I have done cleanup in the patch and I will add this to the September >> commitfest. >> >> I am planning to do further testing for identifying the optimal batch >> size in different workloads. WIth above workload I am seeing similar >> results with batch size 4k to 16k (1/4 of the ring size) so in the attached >> patch I have kept as 1/4 of the ring size. We might change that based on >> more analysis and testing. >> >> -- >> Regards, >> Dilip Kumar >> EnterpriseDB: http://www.enterprisedb.com >> > Hi, > Some minor comments. > For shm_mq.c, existing comment says: > > * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete > > + Size mqh_send_pending; > bool mqh_length_word_complete; > bool mqh_counterparty_attached; > > I wonder if mqh_send_pending should be declared > after mqh_length_word_complete - this way, the order of fields matches the > order of explanation for the fields. > > + if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush) > > The above can be written as: > > + if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1)) > > so that when force_flush is true, the other condition is not evaluated. > > Cheers > There was a typo in suggested code above. It should be: + if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2)) Cheers