Robert Haas <robertmh...@gmail.com> writes: > I am not sure whether this is actually correct, but it seems a lot > more believable than the previous proposals. The problem might be more > general, though. I think when I developed this parallel query stuff I > modeled a lot of it on what you did for parameterized paths. Both > parameterized paths and parallelism can create situations where > executing a path to completion produces fewer rows than you would > otherwise get. In the case of parameterized paths, this happens > because we enforce the parameterization we've chosen on top of the > user-supplied quals. In the case of parallelism, it happens because > the rows are split up across the different workers. I think I intended > that the "rows" field of RelOptInfo should be the row count for the > relation in total, and that the "rows" field of the Path should be the > number of rows we expect to get for one execution of the path. But it > seems like this problem is good evidence that I didn't find all the > places that need to be adjusted for parallelism, and I wouldn't be > very surprised if there are a bunch of others that I overlooked.
I did look at the rest of costsize.c for similar instances, and didn't find any. In any case, I think we have two options: 1. Apply this fix, and in future fix any other places that we identify later. 2. Invent some entirely new scheme that we hope is less mistake-prone. Option #2 is unlikely to lead to any near-term fix, and we certainly wouldn't dare back-patch it. > It's not actually very nice that we end up having to call > clauselist_selectivity() here. We've already called > set_baserel_size_estimates() to figure out how many rows we expect to > have been filtered out by the quals, and it sucks to have to do it > again. Brainstorming wildly and maybe stupidly, I wonder if the whole > model is wrong here. Maybe a path shouldn't have a row count; instead, > maybe it should have a multiplier that it applies to the relation's > row count. Then, if X is parameterized in the same way as its subpath > Y, we can just copy the multiplier up, but now it will be applied to > the new rel's "rows" value, which will have already been adjusted > appropriately by set_baserel_size_estimates(). I've wondered about that too, but it seems to depend on the assumption that clauses are estimated independently by clauselist_selectivity, which has not been true for a long time (and is getting less true not more so). So we could possibly apply something like this for parallelism, but not for parameterized paths, and that makes it less appealing ... IMO anyway. I have thought it might be good to explicitly mark partial paths with the estimated number of workers, which would be effectively the same thing as what you're talking about. But I wonder if we'd not still be better off keeping the path rowcount as being number-of-rows-in-each-worker, and just scale it up by the multiplier for EXPLAIN output. (And then also print the true total number of rows in EXPLAIN ANALYZE.) If we do the inverse of that, then we risk bugs from failing to correct the rowcount during cost-estimation calculations. I kinda feel that the bottom line here is that cost estimation is hard, and we're not going to find a magic bullet that removes bugs. regards, tom lane