On Dec21, 2010, at 15:51 , t...@fuzzy.cz wrote: >>> This is the reason why they choose to always combine the values (with >>> varying weights). >> >> There are no varying weights involved there. What they do is to express >> P(A=x,B=y) once as >> >> ... >> >> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 >> = dist(A)*P(A=x)/(2*dist(A,B)) + >> dist(B)*P(B=x)/(2*dist(A,B)) >> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) >> >> That averaging steps add *no* further data-dependent weights. > > Sorry, by 'varying weights' I didn't mean that the weights are different > for each value of A or B. What I meant is that they combine the values > with different weights (just as you explained).
I'm still not sure we're on the same page here. The resulting formula is *not* a weighted average of P(A=x) and P(B=y), since in general dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one syntactically, but that's about it. The resulting formula instead is an *unweighted* (weights 1) average of the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just as well estimate P(A=x,B=y) with P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B) and it's *still* be the very same uniform bayesian approach, just no longer symmetric in A and B. Which may easily be preferable if you have reasons to believe that this estimate is more correct than the one obtained by swapping A and B. The original paper doesn't deal with that case simply because they don't mention how P(A=x) and P(B=y) are obtained at all. The postgres estimator, on the other hand, knows quite well how it derived P(A=x) and P(B=y) and may have much higher confidence in one value than in the other. Assume for example that you're preparing the statement SELECT * FROM T WHERE A = ? AND B = 1 We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better without an actual value for the parameter "?". The estimate for P(B=1), on the other hand, can use the histogram, and will thus very likely be much more precise. The two estimates for P(A=?,B=1) in this case are P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B). There's a good chance that the former beats the latter, and thus also the average, then. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers