On Jul 14, 2010, at 12:40 PM, Heikki Linnakangas wrote: > On 14/07/10 13:12, PostgreSQL - Hans-Jürgen Schönig wrote: >> hello everybody, >> >> we are currently facing some serious issues with cross correlation issue. >> consider: 10% of all people have breast cancer. we have 2 genders (50:50). >> if i select all the men with breast cancer, i will get basically nobody - >> the planner will overestimate the output. >> this is the commonly known problem ... >> >> this cross correlation problem can be quite nasty in many many cases. >> underestimated nested loops can turn joins into a never ending nightmare and >> so on and so on. >> >> my ideas is the following: >> what if we allow users to specifiy cross-column combinations where we keep >> separate stats? >> maybe somehow like this ... >> >> ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4) >> >> or ... >> >> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = >> y.id2) >> >> clearly we cannot store correlation for all combinations of all columns so >> we somehow have to limit it. >> >> what is the general feeling about something like that? > > +1 is my general feeling, it's good if you can tell the system to collect > additional statistics where needed. And once you have that, you can write an > agent or something to detect automatically which extra statistics might be > useful. >
it seems i can leave my bunker where i was hiding for cover when i was waiting for a reply ;). yes, my idea was to have an agent as well - but this is just some follow up problem. > However, the problem is how to represent and store the cross-correlation. For > fields with low cardinality, like "gender" and boolean "breast-cancer-or-not" > you can count the prevalence of all the different combinations, but that > doesn't scale. Another often cited example is zip code + street address. > There's clearly a strong correlation between them, but how do you represent > that? we could play the same story with a table storing people including their home country and the color of their skin. obviously we will have more black people in african countries.. > > For scalar values we currently store a histogram. I suppose we could create a > 2D histogram for two columns, but that doesn't actually help with the zip > code + street address problem. > i think we might go for a second relation here specifically for this issue and a boolean flag in the current stats table indicating that additional correlation stats exist (to avoid an additional lookup unless really necessary). do you have a useful syntax in mind? the thing is: this issue can be isolated inside a table (e.g. WHERE a.id = a.id2 AND a.id3 = a.id4) or it might span two tables with an arbitrary number of fields. many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers