On 26/05/2010, at 01.16, Jan Urbański <wulc...@wulczer.org> wrote:
On 19/05/10 21:01, Jesper Krogh wrote:
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.
So for tsvectors the statistics target means more or less "at any time
track at most 10 * <target> lexemes simultaneously" where "track"
means
keeping them in memory while going through the tuples being analysed.
Remember that the measure is in lexemes, not whole tsvectors and the
10
factor is meant to approximate the average number of unique lexemes
in a
tsvector. If your documents are very large, this might not be a good
approximation.
I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.
But the distribution is very "flat" at the end, the last 128 values
are
excactly
1.00189e-05
which means that any term sitting outside the array would get an
estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
Yeah, this might meant that you could try cranking up the stats
target a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning
of
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.
I Think i would prefer something less "magic" I Can increase the
statistics target and get more reliable data but that increases also
the amount of tuples being picked out for analysis which is really
time consuming.
But that also means that what gets stored as the lower bound of the
historgram isn't anywhere near the lower bound, more the lower bound
of the "artificial" histogram that happened after the last pruning.
I Would suggest that the pruning in the end should be aware of this.
Perhaps by keeping track of the least frequent value that never got
pruned and using that as the last pruning ans lower bound?
Thanks a lot for the explanation it fits fairly well why i couldn't
construct a simple test set that had the problem.
So far I have no idea if this is bad or good, so a couple of sample
runs
of stuff that
is sitting outside the "most_common_vals" array:
[gathered statistics suck]
So the "most_common_vals" seems to contain a lot of values that
should
never have been kept in favor
of other values that are more common.
In practice, just cranking the statistics estimate up high enough
seems
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?
The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about
the
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c
I'll do some Reading
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers