Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Jesper Krogh Tue, 25 May 2010 21:20:27 -0700


On 26/05/2010, at 01.16, Jan Urbański <wulc...@wulczer.org> wrote:

On 19/05/10 21:01, Jesper Krogh wrote:
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.
So for tsvectors the statistics target means more or less "at any time
track at most 10 * <target> lexemes simultaneously" where "track"means
keeping them in memory while going through the tuples being analysed.
Remember that the measure is in lexemes, not whole tsvectors and the10factor is meant to approximate the average number of unique lexemesin a
tsvector. If your documents are very large, this might not be a good
approximation.


I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.

But the distribution is very "flat" at the end, the last 128 valuesare
excactly
1.00189e-05
which means that any term sitting outside the array would get an
estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
Yeah, this might meant that you could try cranking up the statstarget a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruningof
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.

I Think i would prefer something less "magic" I Can increase thestatistics target and get more reliable data but that increases alsothe amount of tuples being picked out for analysis which is reallytime consuming.

But that also means that what gets stored as the lower bound of thehistorgram isn't anywhere near the lower bound, more the lower boundof the "artificial" histogram that happened after the last pruning.

I Would suggest that the pruning in the end should be aware of this.Perhaps by keeping track of the least frequent value that never gotpruned and using that as the last pruning ans lower bound?

Thanks a lot for the explanation it fits fairly well why i couldn'tconstruct a simple test set that had the problem.

So far I have no idea if this is bad or good, so a couple of sampleruns
of stuff that
is sitting outside the "most_common_vals" array:

[gathered statistics suck]
So the "most_common_vals" seems to contain a lot of values thatshould
never have been kept in favor
of other values that are more common.
In practice, just cranking the statistics estimate up high enoughseems
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?
The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more aboutthe
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c


I'll do some Reading

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Reply via email to