Re: Ordered Partitioned Table Scans

Julien Rouhaud Tue, 30 Oct 2018 16:26:27 -0700

On Mon, Oct 29, 2018 at 1:44 AM David Rowley
<david.row...@2ndquadrant.com> wrote:
>
> On 28 October 2018 at 03:49, Julien Rouhaud <rjuju...@gmail.com> wrote:
> > I just had a look at your patch.  I see that you implemented only a
> > subset of the possible optimizations (only the case for range
> > partitionoing without subpartitions).  This has been previously
> > discussed, but we should be able to do similar optimization for list
> > partitioning if there's no interleaved values, and also for some cases
> > of multi-level partitioning.
>
> I had thought about these cases but originally had thought they would
> be more complex to implement than I could justify. On review, I've
> found some pretty cheap ways to handle both sub-partitions and for
> LIST partitioned tables.  Currently, with LIST partitioned tables I've
> coded it to only allow the optimisation if there's no DEFAULT
> partition and all partitions are defined with exactly 1 Datum. This
> guarantees that there are no interleaved values, but it'll just fail
> to optimise cases like FOR VALUES IN(1,2) + FOR VALUES In(3,4).   The
> reason that I didn't go to the trouble of the additional checks was
> that I don't really want to add any per-partition overhead to this.


I see, but the overhead you mention is because you're doing that check
during the planning in build_partition_pathkeys().  As advised by
Robert quite some time ago
(https://www.postgresql.org/message-id/CA+TgmobOWgT1=zyjx-q=7s8akxnodix46qg0_-yx7k369p6...@mail.gmail.com),
 we can store that information when the PartitionDesc is built, so
that would it wouldn't be problematic.  Since checking for overlapping
values is straightforward with the BoundInfoData infrastructure, it'd
be a pity to miss this optimization in such cases, which I believe
would not be rare.

> If RelOptInfo had a Bitmapset of live partitions then we could just
> check the partitions that survived pruning.  Amit Langote has a
> pending patch which does that and some other useful stuff, so maybe we
> can delay fixing that until the dust settles a bit in that area. Amit
> and I are both working hard to remove all these per-partition
> overheads. I imagine he'd also not be in favour of adding code that
> does something for all partitions when we've pruned down to just 1.
> I've personally no objection to doing the required additional
> processing for the non-pruned partitions only. We could also then fix
> the case where we disable the optimisation if there's a DEFAULT
> partition without any regards to if it's been pruned or not.

Those are quite worthwhile enhancements, and being able to avoid a
MergeAppend if the problematic partitions have been prune would be
great!  I didn't followed thoroughly all the discussions about the
various optimization Amit and you are working on, but I don't think it
would be incompatible with a new flag and the possibility to have the
sorted append with multi valued list partitions?

>
> > Concerning the implementation, there's at least one issue: it assumes
> > that each subpath of a range-partitioned table will be ordered, with
> > is not guaranteed.  You need to to generate explicit Sort nodes nodes
> > (in the same order as the query_pathkey) for partitions that don't
> > have an ordered path and make sure that this path is used in the
> > Append.  Here's a simplistic case showing the issue (sorry, the
> > partition names are poorly chosen):
>
> Thanks for noticing this. I had been thrown off due to the fact that
> Paths are never actually created for these sorts. On looking further I
> see that we do checks during createplan to see if the path is
> suitability sorted and just create a sort node if it's not. This seems
> to go against the whole point of paths, but I'm not going to fight for
> changing it, so I've just done the Append the same way as MergeAppend
> handles it.

Yes, I had quite the same reaction when I saw how MergeAppend handles it.

>
> > Also, if a LIMIT is specified, it should be able to give better
> > estimates, at least if there's no qual.  For instance:
> >
> > EXPLAIN SELECT * FROM simple ORDER BY id LIMIT 10;
> >                                                QUERY PLAN
> >                                 >
> > ------------------------------------------------------------------------------------------------------->
> >  Limit  (cost=0.00..0.35 rows=10 width=15)
> >    ->  Append  (cost=0.00..7705.56 rows=219999 width=15)
> >          ->  Seq Scan on simple_0_1  (cost=0.00..309.00 rows=20000 width=15)
> >          ->  Index Scan using simple_1_2_id_idx on simple_1_2
> > (cost=0.29..3148.28 rows=99999 width=14)
> >          ->  Index Scan using simple_2_3_id_idx on simple_2_3
> > (cost=0.29..3148.29 rows=100000 width=16)
> > (5 rows)
> >
> > In this case, we should estimate that the SeqScan (or in a corrected
> > version the Sort) node should not return more than 10 rows, and each
> > following partition should be scanned at all, and cost each path
> > accordingly.  I think that this is quite important, for instance to
> > make sure that natively sorted Append is chosen over a MergeAppend
> > when there are some subpath with explicit sorts, because with the
> > Append we probably won't have to execute all the sorts if the previous
> > partition scans returned enough rows.
>
> In my patch, I'm not adding any additional paths. I'm just adding an
> Append instead of a MergeAppend.  For what you're talking about the
> limit only needs to be passed into any underlying Sort so that it can
> become a top-N sort.  This is handled already in create_limit_path().
> Notice in the plan you pasted above that the limit has a lower total
> cost than its Append subnode. That's because create_limit_path()
> weighted the Limit total cost based on the row count of the limit and
> its subpath. ... 7705.56 / 219999 * 10 = ~0.35.

Yes.  But the cost of the first partition in this example is wrong
since there was no additional sort on top of the seq scan.

However, I now realize that, as you said, what your patch does is to
generate an Append *instead* of a MergeAppend if the optimization was
possible.  So there can't be the problem of a MergeAppend chosen over
a cheaper Append in some cases, sorry for the noise.  I totally missed
that because when I worked on the same topic last year we had to
generate both Append and MergeAppend.  At that time Append were not
parallel-aware yet, so there could be faster parallel MergeAppend in
some cases.

> > FWIW, both those cases were handled (probably with some bugs though)
> > in the previous patches Ronan and I sent some time ago.  Also, I did
> > not forget about this feature, I planned to work on it in hope to have
> > it included in pg12.  However, I won't have a lot of time to work on
> > it before December.
>
> I apologise for not noticing your patch. I only went as far as
> checking the November commitfest to see if anything existed already
> and I found nothing there.

No worries, it's more than a year old now (I'm quite ashamed I didn't
come back on this sooner).

>  I have time to work on this now, so likely
> it's better if I continue, just in case your time in December does not
> materialise.

I entirely agree.

> v2 of the patch is attached. I've not had time yet to give it a
> throughout post write review, but on first look it seems okay.


I've registered as a reviewer.   I still didn't have a deep look at
the patch yet, but thanks a lot for working on it!

Re: Ordered Partitioned Table Scans

Reply via email to