On 26/05/2010, at 01.16, Jan Urbański <wulc...@wulczer.org> wrote:

On 19/05/10 21:01, Jesper Krogh wrote:
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

So for tsvectors the statistics target means more or less "at any time
track at most 10 * <target> lexemes simultaneously" where "track" means
keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the 10 factor is meant to approximate the average number of unique lexemes in a
tsvector. If your documents are very large, this might not be a good
approximation.

I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.

But the distribution is very "flat" at the end, the last 128 values are
excactly
1.00189e-05
which means that any term sitting outside the array would get an
estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

Yeah, this might meant that you could try cranking up the stats target a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning of
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.

I Think i would prefer something less "magic" I Can increase the statistics target and get more reliable data but that increases also the amount of tuples being picked out for analysis which is really time consuming.

But that also means that what gets stored as the lower bound of the historgram isn't anywhere near the lower bound, more the lower bound of the "artificial" histogram that happened after the last pruning.

I Would suggest that the pruning in the end should be aware of this. Perhaps by keeping track of the least frequent value that never got pruned and using that as the last pruning ans lower bound?

Thanks a lot for the explanation it fits fairly well why i couldn't construct a simple test set that had the problem.


So far I have no idea if this is bad or good, so a couple of sample runs
of stuff that
is sitting outside the "most_common_vals" array:

[gathered statistics suck]

So the "most_common_vals" seems to contain a lot of values that should
never have been kept in favor
of other values that are more common.

In practice, just cranking the statistics estimate up high enough seems
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?

The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about the
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c

I'll do some Reading

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to