On Thu, Jan 30, 2014 at 8:38 PM, Heikki Linnakangas <hlinnakan...@vmware.com > wrote:
> On 01/30/2014 01:53 AM, Tomas Vondra wrote: > >> (3) A file with explain plans for 4 queries suffering ~2x slowdown, >> and explain plans with 9.4 master and Heikki's patches is available >> here: >> >> http://www.fuzzy.cz/tmp/gin/queries.txt >> >> All the queries have 6 common words, and the explain plans look >> just fine to me - exactly like the plans for other queries. >> >> Two things now caught my eye. First some of these queries actually >> have words repeated - either exactly like "database & database" or >> in negated form like "!anything & anything". Second, while >> generating the queries, I use "dumb" frequency, where only exact >> matches count. I.e. "write != written" etc. But the actual number >> of hits may be much higher - for example "write" matches exactly >> just 5% documents, but using @@ it matches more than 20%. >> >> I don't know if that's the actual cause though. >> > > I tried these queries with the data set you posted here: > http://www.postgresql.org/message-id/52e4141e.60...@fuzzy.cz. The reason > for the slowdown is the extra consistent calls it causes. That's expected - > the patch certainly does call consistent in situations where it didn't > before, and if the "pre-consistent" checks are not able to eliminate many > tuples, you lose. > > So, what can we do to mitigate that? Some ideas: > > 1. Implement the catalog changes from Alexander's patch. That ought to > remove the overhead, as you only need to call the consistent function once, > not "both ways". OTOH, currently we only call the consistent function if > there is only one unknown column. If with the catalog changes, we always > call the consistent function even if there are more unknown columns, we > might end up calling it even more often. > > 2. Cache the result of the consistent function. > > 3. Make the consistent function cheaper. (How? Magic?) > > 4. Use some kind of a heuristic, and stop doing the pre-consistent checks > if they're not effective. Not sure what the heuristic would look like. > I'm fiddling around with following idea of such heuristic: 1) Sort entries ascending by estimate of TIDs number. Assume number of entries to be n. 2) Sequentially call ternary consistent with first m of GIN_FALSE and rest n-m of GIN_MAYBE until it returns GIN_FALSE. 3) Decide whether to use fast-scan depending on number of TIDs in first m entries and number of TIDs in rest (n-m) entries. Something like number_of_TIDs_in_frist_m_entries * k < number_of_TIDs_in_rest_n-m_entries where k is some quotient. For sure, it rough method, but it should be fast. Without natural implementation of ternary consistent function algorithm will be slightly different: only m values close to n should be checked. I'm going to implement this heuristic against last version of your patch. ------ With best regards, Alexander Korotkov.