On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow <gregn4...@gmail.com> wrote: > > > > Speaking of costing, I'm not sure I really agree with the current > > costing of a Gather node. Just considering a simple Parallel SeqScan > > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of > > Gather cost always completely drowns out any other path costs when a > > large number of rows are involved (at least with default > > parallel-related GUC values), such that Parallel SeqScan would never > > be the cheapest path. This linear relationship in the costing based on > > the rows and a parallel_tuple_cost doesn't make sense to me. Surely > > after a certain amount of rows, the overhead of launching workers will > > be out-weighed by the benefit of their parallel work, such that the > > more rows, the more likely a Parallel SeqScan will benefit. > > > > That will be true for the number of rows/pages we need to scan not for > the number of tuples we need to return as a result. The formula here > considers the number of rows the parallel scan will return and the > more the number of rows each parallel node needs to pass via shared > memory to gather node the more costly it will be. > > We do consider the total pages we need to scan in > compute_parallel_worker() where we use a logarithmic formula to > determine the number of workers. >
Despite all the best intentions, the current costings seem to be geared towards selection of a non-parallel plan over a parallel plan, the more rows there are in the table. Yet the performance of a parallel plan appears to be better than non-parallel-plan the more rows there are in the table. This doesn't seem right to me. Is there a rationale behind this costing model? I have pointed out the part of the parallel_tuple_cost calculation that seems to drown out all other costs (causing the cost value to be huge), the more rows there are in the table. Regards, Greg Nancarrow Fujitsu Australia