On Fri, Dec 20, 2019 at 1:58 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.ja...@gmail.com> writes: > > The docs for parallel_tuple_cost are quite terse, as the reference > section > > of the docs usually are: > > "Sets the planner's estimate of the cost of transferring one tuple from a > > parallel worker process to another process. The default is 0.1." > > > If you were take the doc description literally, then the default value > > seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost > to > > transfer a tuple up from a parallel worker. > > Really? If anything, I'd have thought it might be worse than 10x. > Cross-process communication isn't cheap, at least not according to > my instincts. >
I was a bit surprised. I set it up so that there was a fine-tunable filter which can be applied in the parallel workers, and then only the surviving tuples get passed up to the leader. The use of a parallel seq scan didn't become slower than the non-parallel version until over 95% of the tuples were surviving the filter. If I wanted to make the estimated cost cross-over point match the measured time cross-over point, I had to mark the parallel_tuple_cost down to about 0.011. This was an 8 CPU machine, an AWS m5.4xlarge, with max_parallel_workers_per_gather=7. (On my crummy 2-CPU Windows 10 laptop running ubuntu via VirtualBox, the cross-over point was closer to 40% of the tuples surviving, and the parallel_tuple_cost to match cross-over point would be about 0.016, but I don't have enough RAM to make a large enough all-in-shared-buffer table to really get a good assessments). My method was to make shared_buffers be a large fraction of RAM (55GB, out of 64GB), then make a table slightly smaller than that and forced it into shared_buffers with pg_prewarm. I set seq_page_cost = random_age_cost = 0, to accurately reflect the fact that no IO is occuring. create table para_seq as select floor(random()*10000)::int as id, random() as x, md5(random()::text)||md5(random()::text) t from generate_series(1,8000000*55); vacuum ANALYZE para_seq ; select pg_prewarm('para_seq'); explain (analyze, buffers, settings, timing off) select * from para_seq where id<9500; Where you can change the 9500 to tune the selectivity of the filter. Is this the correct way to try to isolate just the overhead of transferring of a tuple away from other considerations so it can be measured? I don't think the fact that EXPLAIN ANALYZE throws away the result set without reading it should change anything. Reading it should add the same fixed overhead to both parallel and non-parallel, so would dilute out percentage difference without change absolute differences. I tried it with wider tuples as well, but not so wide they would activate TOAST, and didn't really see a difference in the conclusion. > > On the other hand, you > > probably don't want a query which consumes 8x the CPU resources just to > > finish only 5% faster (on an otherwise idle server with 8 CPUs). Maybe > > this Amdahl factor is what inspired the high default value? > > I think the large value of parallel_setup_cost is what's meant to > discourage that scenario. > I think that can only account for overhead like forking and setting up memory segments. The overhead of moving around tuples (more than single-threaded execution already moves them around) would need to scale with the number of tuples moved around. Cheers, Jeff