Re: [HACKERS] GIN improvements part2: fast scan

Alexander Korotkov Mon, 03 Feb 2014 07:37:49 -0800

On Thu, Jan 30, 2014 at 8:38 PM, Heikki Linnakangas <hlinnakan...@vmware.com
> wrote:


> On 01/30/2014 01:53 AM, Tomas Vondra wrote:
>
>> (3) A file with explain plans for 4 queries suffering ~2x slowdown,
>>      and explain plans with 9.4 master and Heikki's patches is available
>>      here:
>>
>>        http://www.fuzzy.cz/tmp/gin/queries.txt
>>
>>      All the queries have 6 common words, and the explain plans look
>>      just fine to me - exactly like the plans for other queries.
>>
>>      Two things now caught my eye. First some of these queries actually
>>      have words repeated - either exactly like "database & database" or
>>      in negated form like "!anything & anything". Second, while
>>      generating the queries, I use "dumb" frequency, where only exact
>>      matches count. I.e. "write != written" etc. But the actual number
>>      of hits may be much higher - for example "write" matches exactly
>>      just 5% documents, but using @@ it matches more than 20%.
>>
>>      I don't know if that's the actual cause though.
>>
>
> I tried these queries with the data set you posted here:
> http://www.postgresql.org/message-id/52e4141e.60...@fuzzy.cz. The reason
> for the slowdown is the extra consistent calls it causes. That's expected -
> the patch certainly does call consistent in situations where it didn't
> before, and if the "pre-consistent" checks are not able to eliminate many
> tuples, you lose.
>
> So, what can we do to mitigate that? Some ideas:
>
> 1. Implement the catalog changes from Alexander's patch. That ought to
> remove the overhead, as you only need to call the consistent function once,
> not "both ways". OTOH, currently we only call the consistent function if
> there is only one unknown column. If with the catalog changes, we always
> call the consistent function even if there are more unknown columns, we
> might end up calling it even more often.
>
> 2. Cache the result of the consistent function.
>
> 3. Make the consistent function cheaper. (How? Magic?)
>
> 4. Use some kind of a heuristic, and stop doing the pre-consistent checks
> if they're not effective. Not sure what the heuristic would look like.
>

I'm fiddling around with following idea of such heuristic:
1) Sort entries ascending by estimate of TIDs number. Assume number of
entries to be n.
2) Sequentially call ternary consistent with first m of GIN_FALSE and rest
n-m of GIN_MAYBE until it returns GIN_FALSE.
3) Decide whether to use fast-scan depending on number of TIDs in first m
entries and number of TIDs in rest (n-m) entries. Something like
number_of_TIDs_in_frist_m_entries * k < number_of_TIDs_in_rest_n-m_entries
where k is some quotient.
For sure, it rough method, but it should be fast. Without natural
implementation of ternary consistent function algorithm will be slightly
different: only m values close to n should be checked.
I'm going to implement this heuristic against last version of your patch.

------
With best regards,
Alexander Korotkov.

Re: [HACKERS] GIN improvements part2: fast scan

Reply via email to