On Mon, Jun 8, 2020 at 10:16 PM Ashutosh Bapat <ashutosh.bapat....@gmail.com> wrote:
> I know one project where they used PostgreSQL code base to detect > "robust plans". https://dsl.cds.iisc.ac.in/projects/PICASSO/. Some of > the papers cited in https://www.vldb.org/pvldb/vldb2010/papers/D01.pdf > describe the idea. > In short, the idea is to annotate a plan with a "bandwidth" i.e. how > does the plan fair with degradation of statistics. A plan which has a > slightly higher cost which doesn't degrade much with degradation of > statistics is preferred over a low cost plan whose cost rises sharply > with degradation of statistics. This is similar to what David is > suggesting. > > Great to know them, thank you for sharing it, links have been bookmarked. On Fri, Jun 5, 2020 at 12:00 PM Pavel Stehule <pavel.steh...@gmail.com> > wrote: > > > > > > > > pá 5. 6. 2020 v 8:19 odesílatel David Rowley <dgrowle...@gmail.com> > napsal: > >> > >> On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui.fan1...@gmail.com> wrote: > >> > The one-line fix describe the exact idea in my mind: > >> > > >> > +++ b/src/backend/optimizer/path/costsize.c > >> > @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, > double loop_count, > >> > > >> > cpu_run_cost += cpu_per_tuple * tuples_fetched; > >> > > >> > + /* > >> > + * To make the planner more robust to handle some inaccurate > statistics > >> > + * issue, we will add a extra cost to qpquals so that the > less qpquals > >> > + * the lower cost it has. > >> > + */ > >> > + cpu_run_cost += 0.01 * list_length(qpquals); > >> > + > >> > > >> > This change do fix the issue above, but will it make some other cases > worse? My > >> > answer is no based on my current knowledge, and this is most > important place > >> > I want to get advised. The mainly impact I can figure out is: it not > only > >> > change the cost difference between (a, b) and (a, c) it also cause > the cost > >> > difference between Index scan on (a, c) and seq scan. However the > >> > cost different between index scan and seq scan are huge by practice so > >> > I don't think this impact is harmful. > >> > >> Didn't that idea already get shot down in the final paragraph on [1]? > >> > >> I understand that you wish to increase the cost by some seemingly > >> innocent constant to fix your problem case. Here are my thoughts > >> about that: Telling lies is not really that easy to pull off. Bad > >> liers think it's easy and good ones know it's hard. The problem is > >> that the lies can start small, but then at some point the future you > >> must fashion some more lies to account for your initial lie. Rinse and > >> repeat that a few times and before you know it, your course is set > >> well away from the point of truth. I feel the part about "rinse and > >> repeat" applies reasonably well to how join costing works. The lie is > >> likely to be amplified as the join level gets deeper. > >> > >> I think you need to think of a more generic solution and propose that > >> instead. There are plenty of other quirks in the planner that can > >> cause suffering due to inaccurate or non-existing statistics. For > >> example, due to how we multiply individual selectivity estimates, > >> having a few correlated columns in a join condition can cause the > >> number of join rows to be underestimated. Subsequent joins can then > >> end up making bad choices on which join operator to use based on those > >> inaccurate row estimates. There's also a problem with WHERE <x> ORDER > >> BY col LIMIT n; sometimes choosing an index that provides pre-sorted > >> input to the ORDER BY but cannot use <x> as an indexable condition. > >> We don't record any stats to make better choices there, maybe we > >> should, but either way, we're taking a bit risk there as all the rows > >> matching <x> might be right at the end of the index and we might need > >> to scan the entire thing before hitting the LIMIT. For now, we just > >> assume completely even distribution of rows. i.e. If there are 50 rows > >> estimated in the path and the limit is for 5 rows, then we'll assume > >> we need to read 10% of those before finding all the ones we need. In > >> reality, everything matching <x> might be 95% through the index and we > >> could end up reading 100% of rows. That particular problem is not just > >> caused by the uneven distribution of rows in the index, but also from > >> selectivity underestimation. > >> > >> I'd more recently wondered if we shouldn't have some sort of "risk" > >> factor to the cost model. I really don't have ideas on how exactly we > >> would calculate the risk factor in all cases, but in this case, say > >> the risk factor always starts out as 1. We could multiply that risk > >> factor by some >1 const each time we find another index filter qual. > >> add_path() can prefer lower risk paths when the costs are similar. > >> Unsure what the exact add_path logic would be. Perhaps a GUC would > >> need to assist with the decision there. Likewise, with > >> NestedLoopPaths which have a large number of join quals, the risk > >> factor could go up a bit with those so that we take a stronger > >> consideration for hash or merge joins instead. > >> > > > > I thought about these ideas too. And I am not alone. > > > > https://hal.archives-ouvertes.fr/hal-01316823/document > > > > Regards > > > > Pavel > > > >> Anyway, it's pretty much a large research subject which would take a > >> lot of work to iron out even just the design. It's likely not a > >> perfect idea, but I think it has a bit more merit that trying to > >> introduce lies to the cost modal to account for a single case where > >> there is a problem. > >> > >> David > >> > >> [1] > https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development > >> > >> > > > -- > Best Wishes, > Ashutosh Bapat > -- Best Regards Andy Fan