Re: A wrong index choose issue because of inaccurate statistics

Andy Fan Mon, 08 Jun 2020 22:57:18 -0700

On Mon, Jun 8, 2020 at 10:16 PM Ashutosh Bapat <[email protected]>
wrote:


> I know one project where they used PostgreSQL code base to detect
> "robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of
> the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf
> describe the idea.


>
In short, the idea is to annotate a plan with a "bandwidth" i.e. how
> does the plan fair with degradation of statistics. A plan which has a
> slightly higher cost which doesn't degrade much with degradation of
> statistics is preferred over a low cost plan whose cost rises sharply
> with degradation of statistics. This is similar to what David is
> suggesting.
>
>
Great to know them,  thank you for sharing it, links have been bookmarked.

On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule <[email protected]>
> wrote:
> >
> >
> >
> > pá 5. 6. 2020 v 8:19 odesílatel David Rowley <[email protected]>
> napsal:
> >>
> >> On Mon, 1 Jun 2020 at 01:24, Andy Fan <[email protected]> wrote:
> >> > The one-line fix describe the exact idea in my mind:
> >> >
> >> > +++ b/src/backend/optimizer/path/costsize.c
> >> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root,
> double loop_count,
> >> >
> >> >         cpu_run_cost += cpu_per_tuple * tuples_fetched;
> >> >
> >> > +       /*
> >> > +        * To make the planner more robust to handle some inaccurate
> statistics
> >> > +        * issue, we will add a extra cost to qpquals so that the
> less qpquals
> >> > +        * the lower cost it has.
> >> > +        */
> >> > +       cpu_run_cost += 0.01 * list_length(qpquals);
> >> > +
> >> >
> >> > This change do fix the issue above, but will it make some other cases
> worse? My
> >> > answer is no based on my current knowledge, and this is most
> important place
> >> > I want to get advised. The mainly impact I can figure out is: it not
> only
> >> > change the cost difference between (a, b) and (a, c) it also cause
> the cost
> >> > difference between Index scan on (a, c) and seq scan.  However the
> >> > cost different between index scan and seq scan are huge by practice so
> >> > I don't think this impact is harmful.
> >>
> >> Didn't that idea already get shot down in the final paragraph on [1]?
> >>
> >> I understand that you wish to increase the cost by some seemingly
> >> innocent constant to fix your problem case.  Here are my thoughts
> >> about that: Telling lies is not really that easy to pull off. Bad
> >> liers think it's easy and good ones know it's hard. The problem is
> >> that the lies can start small, but then at some point the future you
> >> must fashion some more lies to account for your initial lie. Rinse and
> >> repeat that a few times and before you know it, your course is set
> >> well away from the point of truth.  I feel the part about "rinse and
> >> repeat" applies reasonably well to how join costing works.  The lie is
> >> likely to be amplified as the join level gets deeper.
> >>
> >> I think you need to think of a more generic solution and propose that
> >> instead.  There are plenty of other quirks in the planner that can
> >> cause suffering due to inaccurate or non-existing statistics. For
> >> example, due to how we multiply individual selectivity estimates,
> >> having a few correlated columns in a join condition can cause the
> >> number of join rows to be underestimated. Subsequent joins can then
> >> end up making bad choices on which join operator to use based on those
> >> inaccurate row estimates.  There's also a problem with WHERE <x> ORDER
> >> BY col LIMIT n; sometimes choosing an index that provides pre-sorted
> >> input to the ORDER BY but cannot use <x> as an indexable condition.
> >> We don't record any stats to make better choices there, maybe we
> >> should, but either way, we're taking a bit risk there as all the rows
> >> matching <x> might be right at the end of the index and we might need
> >> to scan the entire thing before hitting the LIMIT. For now, we just
> >> assume completely even distribution of rows. i.e. If there are 50 rows
> >> estimated in the path and the limit is for 5 rows, then we'll assume
> >> we need to read 10% of those before finding all the ones we need. In
> >> reality, everything matching <x> might be 95% through the index and we
> >> could end up reading 100% of rows. That particular problem is not just
> >> caused by the uneven distribution of rows in the index, but also from
> >> selectivity underestimation.
> >>
> >> I'd more recently wondered if we shouldn't have some sort of "risk"
> >> factor to the cost model. I really don't have ideas on how exactly we
> >> would calculate the risk factor in all cases, but in this case,  say
> >> the risk factor always starts out as 1. We could multiply that risk
> >> factor by some >1 const each time we find another index filter qual.
> >> add_path() can prefer lower risk paths when the costs are similar.
> >> Unsure what the exact add_path logic would be. Perhaps a GUC would
> >> need to assist with the decision there.   Likewise, with
> >> NestedLoopPaths which have a large number of join quals, the risk
> >> factor could go up a bit with those so that we take a stronger
> >> consideration for hash or merge joins instead.
> >>
> >
> > I thought about these ideas too. And I am not alone.
> >
> > https://hal.archives-ouvertes.fr/hal-01316823/document
> >
> > Regards
> >
> > Pavel
> >
> >> Anyway, it's pretty much a large research subject which would take a
> >> lot of work to iron out even just the design. It's likely not a
> >> perfect idea, but I think it has a bit more merit that trying to
> >> introduce lies to the cost modal to account for a single case where
> >> there is a problem.
> >>
> >> David
> >>
> >> [1]
> https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development
> >>
> >>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
>


-- 
Best Regards
Andy Fan

Re: A wrong index choose issue because of inaccurate statistics

Reply via email to