Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Jan Urbański Sat, 29 May 2010 09:14:45 -0700

On 29/05/10 17:34, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulc...@wulczer.org> writes:
>> On 29/05/10 17:09, Tom Lane wrote:
>>> There is definitely something wrong with your math there.  It's not
>>> possible for the 100'th most common word to have a frequency as high
>>> as 0.06 --- the ones above it presumably have larger frequencies,
>>> which makes the total quite a lot more than 1.0.
> 
>> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
>> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
> 
> Um, apparently I can't do simple arithmetic first thing in the morning
> either, cause I got my number wrong too ;-)
> 
> After a bit more research: if you use the basic form of Zipf's law
> with a 1/k distribution, the first frequency has to be about 0.07
> to make the total come out to 1.0 for a reasonable number of words.
> So we could use s = 0.07 / K when we wanted a final list of K words.
> Some people (including the LC paper) prefer a higher exponent, ie
> 1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
> seems awfully high for the type of data we're working with, so I think
> the 1/k rule is probably what we want here.


OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts < 1887. For s/3, s/4 etc the numbers look like this:

s/1    1428    0
s/2    2857    1887
s/3    4285    2516
s/4    5714    2831
s/5    7142    3019
s/6    8571    3145
s/7    10000   3235
s/8    11428   3302
s/9    12857   3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Reply via email to