Re: [HACKERS] Cross-column statistics revisited

Greg Stark Thu, 16 Oct 2008 15:21:31 -0700

Correlation is the wrong tool. In fact zip codes and city have nearlyzero correlation. Zip codes near 00000 are no more likely to be incities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with theinformation. Consider the case of WHERE city='boston' and zip='02139'and another query with WHERE city='boston' and zip='90210'. One willproduce many more records than the separate histograms would predictand the other would produce zero. How do you determine which categorya given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole otherkettle of worms. I'm not sure which is easier but let's do one at atime?



greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <[EMAIL PROTECTED]> wrote:

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?
Well, we have two different correlation problems. One is theproblem ofdependant correlation, such as the 1.0 correlation of ZIP and CITYfieldsas a common problem. This could in fact be fixed, I believe, via alinearmath calculation based on the sampled level of correlation, assumingwehave enough samples. And it's really only an issue if thecorrelation is
0.5.
The second type of correlation issue we have is correlating valuesin aparent table with *rows* in child table (i.e. FK joins). Currently,theplanner assumes that all rows in the child table are evenlydistributedagainst keys in the parent table. But many real-world databaseshave this
kind of problem:

A    B
1    10000 rows
2    10000 rows
3    1000 rows
4 .. 1000    0 to 1 rows
For queries which cover values between 4..1000 on A, the misestimatewon'tbe much of a real execution problem. But for values 1,2,3, thequery will
bomb.
The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.
My experience is that any estimate which is more than 5x wrong (i.e.< .2
or > 5.0) usually causes problems, and 3x sometimes causes problems.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cross-column statistics revisited

Reply via email to