On 29/05/10 17:34, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulc...@wulczer.org> writes: >> On 29/05/10 17:09, Tom Lane wrote: >>> There is definitely something wrong with your math there. It's not >>> possible for the 100'th most common word to have a frequency as high >>> as 0.06 --- the ones above it presumably have larger frequencies, >>> which makes the total quite a lot more than 1.0. > >> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be >> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 > > Um, apparently I can't do simple arithmetic first thing in the morning > either, cause I got my number wrong too ;-) > > After a bit more research: if you use the basic form of Zipf's law > with a 1/k distribution, the first frequency has to be about 0.07 > to make the total come out to 1.0 for a reasonable number of words. > So we could use s = 0.07 / K when we wanted a final list of K words. > Some people (including the LC paper) prefer a higher exponent, ie > 1/k^S with S around 1.25. That makes the F1 value around 0.22 which > seems awfully high for the type of data we're working with, so I think > the 1/k rule is probably what we want here.
OK, I think we're getting somewhere :o) I took the formula from Wikipedia's page on Zipf's law, assuming an exponent of 1: rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is the number of words in English Then I took the nth harmonic number expansion from the page on harmonic numbers: H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 + O(n^-6) Assuming 1 million words in English and the big-O term in the harmonic expansion to be 1, we get H(1e6) = 14.3927, which would make the frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say 0.07). Which brings me to the same result as yours, which in turn reassures me a lot ;) My previous result was wrong because I used the wrong logarithm base, go figure. So with this, for statistics target of 100 we would predict the frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in the input the bucket width and the final pruning value depend only on the epsilon that we choose for the LC algorithm. So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 = 1428 lexemes and would not discard anything from the result. If we want e to be s/2 we'd prune every 2857 lexemes and discard lexemes with counts < 1887. For s/3, s/4 etc the numbers look like this: s/1 1428 0 s/2 2857 1887 s/3 4285 2516 s/4 5714 2831 s/5 7142 3019 s/6 8571 3145 s/7 10000 3235 s/8 11428 3302 s/9 12857 3355 s/2 or s/3 look reasonable. So, should I just write a patch that sets the bucket width and pruning count using 0.07 as the assumed frequency of the most common word and epsilon equal to s/2 or s/3? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers