Re: [HACKERS] WIP: multivariate statistics / proof of concept

Haribabu Kommi Fri, 02 Dec 2016 04:03:51 -0800

On Tue, Nov 22, 2016 at 2:42 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:


> On 11/21/2016 11:10 PM, Robert Haas wrote:
>
>> [ reviving an old multivariate statistics thread ]
>>
>> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <si...@2ndquadrant.com>
>> wrote:
>>
>>> On 12 October 2014 23:00, Tomas Vondra <t...@fuzzy.cz> wrote:
>>>
>>> It however seems to be working sufficiently well at this point, enough
>>>> to get some useful feedback. So here we go.
>>>>
>>>
>>> This looks interesting and useful.
>>>
>>> What I'd like to check before a detailed review is that this has
>>> sufficient applicability to be useful.
>>>
>>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>>> result of multi-column stats errors.
>>>
>>> Could you look at those queries and confirm that this patch can
>>> produce better plans for them?
>>>
>>
>> Tomas, did you ever do any testing in this area?  One of my
>> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
>> 20 GB.  Q18 actually doesn't complete at all right now because of an
>> issue with the new simplehash implementation.  I reported it to Andres
>> and he tracked it down, but hasn't posted the patch yet - see
>> http://archives.postgresql.org/message-id/20161115192802.jfb
>> ec5s6ougxw...@alap3.anarazel.de
>>
>> Of the remaining queries, the slowest are Q9 and Q20, and both of them
>> have serious estimation errors.  On Q9, things go wrong here:
>>
>>                                  ->  Merge Join
>> (cost=5225092.04..6595105.57 rows=154 width=47) (actual
>> time=103592.821..149335.010 rows=6503988 loops=1)
>>                                        Merge Cond:
>> (partsupp.ps_partkey = lineitem.l_partkey)
>>                                        Join Filter:
>> (lineitem.l_suppkey = partsupp.ps_suppkey)
>>                                        Rows Removed by Join Filter:
>> 19511964
>>                                        ->  Index Scan using
>>
> > [snip]
>
>>
>> Rows Removed by Filter: 756627
>>
>> The estimate for the index scan on partsupp is essentially perfect,
>> and the lineitem-part join is off by about 3x.  However, the merge
>> join is off by about 4000x, which is real bad.
>>
>>
> The patch only deals with statistics on base relations, no joins, at this
> point. It's meant to be extended in that direction, so the syntax supports
> it, but at this point that's all. No joins.
>
> That being said, this estimate should be improved in 9.6, when you create
> a foreign key between the tables. In fact, that patch was exactly about Q9.
>
> This is how the join estimate looks on scale 1 without the FK between the
> two tables:
>
>                           QUERY PLAN
> -----------------------------------------------------------------------
>  Merge Join  (cost=19.19..700980.12 rows=2404 width=261)
>    Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
>                 (lineitem.l_suppkey = partsupp.ps_suppkey))
>    ->  Index Scan using idx_lineitem_part_supp on lineitem
>                 (cost=0.43..605856.84 rows=6001117 width=117)
>    ->  Index Scan using partsupp_pkey on partsupp
>                 (cost=0.42..61141.76 rows=800000 width=144)
> (4 rows)
>
>
> and with the foreign key:
>
>                              QUERY PLAN
> -----------------------------------------------------------------------
>  Merge Join  (cost=19.19..700980.12 rows=6001117 width=261)
>              (actual rows=6001215 loops=1)
>    Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
>                 (lineitem.l_suppkey = partsupp.ps_suppkey))
>    ->  Index Scan using idx_lineitem_part_supp on lineitem
>                 (cost=0.43..605856.84 rows=6001117 width=117)
>                 (actual rows=6001215 loops=1)
>    ->  Index Scan using partsupp_pkey on partsupp
>                 (cost=0.42..61141.76 rows=800000 width=144)
>                 (actual rows=6001672 loops=1)
>  Planning time: 3.840 ms
>  Execution time: 21987.913 ms
> (6 rows)
>
>
> On Q20, things go wrong here:
>>
> >
>
>> [snip]
>>
>> The estimate for the GroupAggregate feeding one side of the merge join
>> is quite accurate.  The estimate for the part-partsupp join on the
>> other side is off by 8x.  Then things get much worse: the estimate for
>> the merge join is off by 400x.
>>
>>
> Well, most of the estimation error comes from the join, but sadly the
> aggregate makes using the foreign keys impossible - at least in the current
> version. I don't know if it can be improved, somehow.
>
> I'm not really sure whether the multivariate statistics stuff will fix
>> this kind of case or not, but if it did it would be awesome.
>>
>>
> Join statistics are something I'd like to add eventually, but I don't see
> how it could happen in the first version. Also, the patch received no
> reviews this CF, and making it even larger is unlikely to make it more
> attractive.
>

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] WIP: multivariate statistics / proof of concept

Reply via email to