Correlation is the wrong tool. In fact zip codes and city have nearly zero correlation. Zip codes near 00000 are no more likely to be in cities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with the information. Consider the case of WHERE city='boston' and zip='02139' and another query with WHERE city='boston' and zip='90210'. One will produce many more records than the separate histograms would predict and the other would produce zero. How do you determine which category a given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole other kettle of worms. I'm not sure which is easier but let's do one at a time?


greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <[EMAIL PROTECTED]> wrote:


Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This could in fact be fixed, I believe, via a linear math calculation based on the sampled level of correlation, assuming we have enough samples. And it's really only an issue if the correlation is
0.5.

The second type of correlation issue we have is correlating values in a parent table with *rows* in child table (i.e. FK joins). Currently, the planner assumes that all rows in the child table are evenly distributed against keys in the parent table. But many real-world databases have this
kind of problem:

A    B
1    10000 rows
2    10000 rows
3    1000 rows
4 .. 1000    0 to 1 rows

For queries which cover values between 4..1000 on A, the misestimate won't be much of a real execution problem. But for values 1,2,3, the query will
bomb.

The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.

My experience is that any estimate which is more than 5x wrong (i.e. < .2
or > 5.0) usually causes problems, and 3x sometimes causes problems.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to