On 2010-05-31 20:38, Tom Lane wrote:
Jesper Krogh writes:
Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.
Good. How
Jesper Krogh writes:
> Just a small follow up. I tried out the patch (or actually a fresh git
> checkout) and it now gives very accurate results for both upper and
> lower end of the MCE-histogram with a lower cutoff that doesn't
> approach 2.
Good. How much did the ANALYZE time change for your
On 2010-05-30 20:02, Jan Urbański wrote:
Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any p
On 31/05/10 00:07, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> I committed the attached revised version of the patch. Revisions are
> mostly minor but I did make two substantive changes:
>
> * The patch changed the target number of mcelems from 10 *
> statistics_target to just s
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> Here's a patch against recent git, but should apply to 8.4 sources as
> well. It would be interesting to measure the memory and time needed to
> analyse the table after applying it, because we will be now using a lot
> bigger bucket size and I haven't don
On 30/05/10 09:08, Jesper Krogh wrote:
> On 2010-05-29 15:56, Jan Urbański wrote:
>> On 29/05/10 12:34, Jesper Krogh wrote:
>>> I can "fairly easy" try out patches or do other kind of testing.
>>>
>> I'll try to come up with a patch for you to try and fiddle with these
>> values before Monday
Jan =?UTF-8?Q?Urba=C5=84ski?= writes:
>> I think the only relevance of stopwords to the current problem is that
>> *if* stopwords have been removed, we would see a Zipfian distribution
>> with the first few entries removed, and I'm not sure if it's still
>> really Zipfian afterwards.
> That's why
> Jesper Krogh writes:
> > On 2010-05-29 15:56, Jan Urbański wrote:
> > > AFAIK statistics for everything other than tsvectors are built based
> > > on the values of whole rows.
>
> > Wouldn't it make sense to treat array types like the tsvectors?
>
> Yeah, I have a personal TODO item to look in
Jesper Krogh writes:
> On 2010-05-29 15:56, Jan UrbaÅski wrote:
>> AFAIK statistics for everything other than tsvectors are built based on
>> the values of whole rows.
> Wouldn't it make sense to treat array types like the tsvectors?
Yeah, I have a personal TODO item to look into that in the fu
On 2010-05-29 15:56, Jan Urbański wrote:
On 29/05/10 12:34, Jesper Krogh wrote:
On 2010-05-28 23:47, Jan Urbański wrote:
On 28/05/10 22:22, Tom Lane wrote:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assum
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> [ e of ] s/2 or s/3 look reasonable.
The examples in the LC paper seem to all use e = s/10. Note the stated
assumption e << s.
> So, should I just write a patch that sets the bucket width and pruning
> count using 0.07 as the assumed frequency of the m
On 29/05/10 17:34, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
>> On 29/05/10 17:09, Tom Lane wrote:
>>> There is definitely something wrong with your math there. It's not
>>> possible for the 100'th most common word to have a frequency as high
>>> as 0.06 --- the ones above it pre
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> On 29/05/10 17:09, Tom Lane wrote:
>> There is definitely something wrong with your math there. It's not
>> possible for the 100'th most common word to have a frequency as high
>> as 0.06 --- the ones above it presumably have larger frequencies,
>> which
On 29/05/10 17:09, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>
>> I chose s as 1/(st + 10)*H(W) because the
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> Hm, I am now thinking that maybe this theory is flawed, because tsvecors
> contain only *unique* words, and Zipf's law is talking about words in
> documents in general. Normally a word like "the" would appear lots of
> times in a document, but (even ignor
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.
> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably
On 29/05/10 12:34, Jesper Krogh wrote:
> On 2010-05-28 23:47, Jan Urbański wrote:
>> On 28/05/10 22:22, Tom Lane wrote:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>>
>>
On 2010-05-28 04:47, Tom Lane wrote:
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list
I should have been more precise in the wording. Cranking up the
stats target gave me overall a "better plan", but tha
On 2010-05-28 23:47, Jan Urbański wrote:
On 28/05/10 22:22, Tom Lane wrote:
The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number
On 28/05/10 22:22, Tom Lane wrote:
> The idea that I was toying with is to assume a Zipfian distribution of
> the input (with some reasonable parameter), and use that to estimate
> what the frequency of the K'th element will be, where K is the target
> number of MCV entries or perhaps a bit more.
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> We follow the algorithm as written, the trouble starts when we want to
> output the result. The paper says which items from the D structure
> should be returned when the user asks for items that have frequencies
> higher than a threshold s. What we want t
On 28/05/10 04:47, Tom Lane wrote:
> I re-scanned that paper and realized that there is indeed something
> wrong with the way we are doing it. The paper says (last sentence in
> the definition of the algorithm, section 4.2):
>
> When a user requests a list of items with threshold s, we outp
On 28/05/10 04:47, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
>> On 19/05/10 21:01, Jesper Krogh wrote:
>>> In practice, just cranking the statistics estimate up high enough seems
>>> to solve the problem, but doesn't
>>> there seem to be something wrong in how the statistics are c
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes:
> On 19/05/10 21:01, Jesper Krogh wrote:
>> In practice, just cranking the statistics estimate up high enough seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?
> The algorithm to determine m
On 26/05/2010, at 01.16, Jan Urbański wrote:
On 19/05/10 21:01, Jesper Krogh wrote:
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.
So for tsvectors the statistics target means more or less "at any
On 19/05/10 21:01, Jesper Krogh wrote:
> The document base is arount 350.000 documents and
> I have set the statistics target on the tsvector column
> to 1000 since the 100 seems way of.
So for tsvectors the statistics target means more or less "at any time
track at most 10 * lexemes simultaneous
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010:
> But the distribution is very "flat" at the end, the last 128 values are
> excactly
> 1.00189e-05
> which means that any term sitting outside the array would get an estimate of
> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
I d
Hi.
I am working on getting full-text-search to work and have
come across something I think look a bit strange.
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.
# ANALYZE verbose reference (document_tsv
28 matches
Mail list logo