On Mon, Oct 29, 2018 at 1:44 AM David Rowley <david.row...@2ndquadrant.com> wrote: > > On 28 October 2018 at 03:49, Julien Rouhaud <rjuju...@gmail.com> wrote: > > I just had a look at your patch. I see that you implemented only a > > subset of the possible optimizations (only the case for range > > partitionoing without subpartitions). This has been previously > > discussed, but we should be able to do similar optimization for list > > partitioning if there's no interleaved values, and also for some cases > > of multi-level partitioning. > > I had thought about these cases but originally had thought they would > be more complex to implement than I could justify. On review, I've > found some pretty cheap ways to handle both sub-partitions and for > LIST partitioned tables. Currently, with LIST partitioned tables I've > coded it to only allow the optimisation if there's no DEFAULT > partition and all partitions are defined with exactly 1 Datum. This > guarantees that there are no interleaved values, but it'll just fail > to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4). The > reason that I didn't go to the trouble of the additional checks was > that I don't really want to add any per-partition overhead to this.
I see, but the overhead you mention is because you're doing that check during the planning in build_partition_pathkeys(). As advised by Robert quite some time ago (https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akxnodix46qg0_-yx7k369p6...@mail.gmail.com), we can store that information when the PartitionDesc is built, so that would it wouldn't be problematic. Since checking for overlapping values is straightforward with the BoundInfoData infrastructure, it'd be a pity to miss this optimization in such cases, which I believe would not be rare. > If RelOptInfo had a Bitmapset of live partitions then we could just > check the partitions that survived pruning. Amit Langote has a > pending patch which does that and some other useful stuff, so maybe we > can delay fixing that until the dust settles a bit in that area. Amit > and I are both working hard to remove all these per-partition > overheads. I imagine he'd also not be in favour of adding code that > does something for all partitions when we've pruned down to just 1. > I've personally no objection to doing the required additional > processing for the non-pruned partitions only. We could also then fix > the case where we disable the optimisation if there's a DEFAULT > partition without any regards to if it's been pruned or not. Those are quite worthwhile enhancements, and being able to avoid a MergeAppend if the problematic partitions have been prune would be great! I didn't followed thoroughly all the discussions about the various optimization Amit and you are working on, but I don't think it would be incompatible with a new flag and the possibility to have the sorted append with multi valued list partitions? > > > Concerning the implementation, there's at least one issue: it assumes > > that each subpath of a range-partitioned table will be ordered, with > > is not guaranteed. You need to to generate explicit Sort nodes nodes > > (in the same order as the query_pathkey) for partitions that don't > > have an ordered path and make sure that this path is used in the > > Append. Here's a simplistic case showing the issue (sorry, the > > partition names are poorly chosen): > > Thanks for noticing this. I had been thrown off due to the fact that > Paths are never actually created for these sorts. On looking further I > see that we do checks during createplan to see if the path is > suitability sorted and just create a sort node if it's not. This seems > to go against the whole point of paths, but I'm not going to fight for > changing it, so I've just done the Append the same way as MergeAppend > handles it. Yes, I had quite the same reaction when I saw how MergeAppend handles it. > > > Also, if a LIMIT is specified, it should be able to give better > > estimates, at least if there's no qual. For instance: > > > > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10; > > QUERY PLAN > > > > > -------------------------------------------------------------------------------------------------------> > > Limit (cost=0.00..0.35 rows=10 width=15) > > -> Append (cost=0.00..7705.56 rows=219999 width=15) > > -> Seq Scan on simple_0_1 (cost=0.00..309.00 rows=20000 width=15) > > -> Index Scan using simple_1_2_id_idx on simple_1_2 > > (cost=0.29..3148.28 rows=99999 width=14) > > -> Index Scan using simple_2_3_id_idx on simple_2_3 > > (cost=0.29..3148.29 rows=100000 width=16) > > (5 rows) > > > > In this case, we should estimate that the SeqScan (or in a corrected > > version the Sort) node should not return more than 10 rows, and each > > following partition should be scanned at all, and cost each path > > accordingly. I think that this is quite important, for instance to > > make sure that natively sorted Append is chosen over a MergeAppend > > when there are some subpath with explicit sorts, because with the > > Append we probably won't have to execute all the sorts if the previous > > partition scans returned enough rows. > > In my patch, I'm not adding any additional paths. I'm just adding an > Append instead of a MergeAppend. For what you're talking about the > limit only needs to be passed into any underlying Sort so that it can > become a top-N sort. This is handled already in create_limit_path(). > Notice in the plan you pasted above that the limit has a lower total > cost than its Append subnode. That's because create_limit_path() > weighted the Limit total cost based on the row count of the limit and > its subpath. ... 7705.56 / 219999 * 10 = ~0.35. Yes. But the cost of the first partition in this example is wrong since there was no additional sort on top of the seq scan. However, I now realize that, as you said, what your patch does is to generate an Append *instead* of a MergeAppend if the optimization was possible. So there can't be the problem of a MergeAppend chosen over a cheaper Append in some cases, sorry for the noise. I totally missed that because when I worked on the same topic last year we had to generate both Append and MergeAppend. At that time Append were not parallel-aware yet, so there could be faster parallel MergeAppend in some cases. > > FWIW, both those cases were handled (probably with some bugs though) > > in the previous patches Ronan and I sent some time ago. Also, I did > > not forget about this feature, I planned to work on it in hope to have > > it included in pg12. However, I won't have a lot of time to work on > > it before December. > > I apologise for not noticing your patch. I only went as far as > checking the November commitfest to see if anything existed already > and I found nothing there. No worries, it's more than a year old now (I'm quite ashamed I didn't come back on this sooner). > I have time to work on this now, so likely > it's better if I continue, just in case your time in December does not > materialise. I entirely agree. > v2 of the patch is attached. I've not had time yet to give it a > throughout post write review, but on first look it seems okay. I've registered as a reviewer. I still didn't have a deep look at the patch yet, but thanks a lot for working on it!