Hi.

First, thanks for looking at this. Except from GIN indexes and
full-text-search being really good in our applications, this also
points to those excact places where it can be improved.

On 2012-02-17 00:15, Tom Lane wrote:
I looked into the complaint here of poor estimation for GIN indexscans:
http://archives.postgresql.org/pgsql-performance/2012-02/msg00028.php

I think this is the excact same issue:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01754.php

At first glance it sounds like a mistake in selectivity estimation,
but it isn't: the rowcount estimates are pretty nearly dead on.
The problem is in the planner's estimate of the cost of executing the
@@ operator.  We have pg_proc.procost set to 1 for ts_match_vq, but
actually it's a good deal more expensive than that.  Some
experimentation suggests that @@ might be about 500 times as expensive
as a simple integer comparison.  I don't propose pushing its procost
up that much, but surely at least 10 would be appropriate, maybe even
100.

However ... if you just alter pg_proc.procost in Marc's example, the
planner *still* picks a seqscan, even though its estimate of the seqscan
cost surely does go up.  The reason is that its estimate of the GIN
indexscan cost goes up just as much, since we charge one qual eval cost
per returned tuple in gincostestimate.  It is easy to tell from the
actual runtimes that that is not what's happening in a GIN indexscan;
we are not re-executing the @@ operator for every tuple.  But the
planner's cost model doesn't know that.

There is something about lossy vs. non-lossy, if the index-result
is lossy, then it would "need" to execute the @@ operator
on each tuple and de-toast the toasted stuff and go all the way.

If it isn't then at  least count() on a gin-index should be able to
utillize an index-only scan now?

I've had a significant amout of struggle over the years in this
corner and the patch that went in for gincostestimate brought
a huge set of problems to the ground, but not all.

Other related threads:
http://archives.postgresql.org/pgsql-performance/2010-05/msg00031.php
(ts_match_vq cost in discussion)
http://archives.postgresql.org/pgsql-performance/2010-05/msg00266.php

I dont think I have ever seen the actual run-time of any @@ query
to be faster going through the seq-scan than going through the index. Not
even if it is pulling near all the tuples out.

(test-case that tries to go in that corner).
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php

And I think is it due to a coulple of "real-world" things:
1) The tsvector-column is typically toasted.
2) The selected columns are typically in the main table.
3) The gin-index search + pulling main table is in
    fact a measuable cheaper operation than pulling main+toast
    uncompressing toast and applying ts_match_vq even in the most
   favourable case for the seqscan.

Another real-world thing is that since the tsvector column is in toast
and isn't read when performing a bitmap-heap-scan, in addition
to the decompress-cost is it almost never hot in memory either,
causing its actuall runtime to be even worse.

Same problems hit a index-scan on another key where filtering
on a @@ operator, but I think I got around most of them by bumping
both cost of @@ and limit in the query to 10K instead of the 200 actually
wanted.

I do think I have been digging sufficiently in this corner and can
fairly easy test and craft test-examples that will demonstrate
the challenges. (a few is attached in above links).

Thanks for digging in this corner. Let me know if i can help, allthough
my actual coding skills are spare (at best).

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to