Re: Avoid full GIN index scan when possible

Tomas Vondra Mon, 06 Jan 2020 07:23:14 -0800

On Fri, Dec 27, 2019 at 04:36:14AM +0300, Nikita Glukhov wrote:

On 26.12.2019 4:59, Alexander Korotkov wrote:

I've tried to add patch #4 to comparison, but I've catch assertionfailure.
TRAP: FailedAssertion("key->includeNonMatching", File: "ginget.c",Line: 1340)

There simply should be inverted condition in the assertion:
Assert(!key->includeNonMatching);

I have looked at v9 patch, and here is my review:

1. I agree with NULL-flag handling simplifications in ginNewScanKey(),
ginScanKeyAddHiddenEntry() extraction.

2. I also agree that usage of nrequired/nadditional in keyGetItem() is a more
natural solution to implement exclusion keys than my previous attempt of doing
that in scanGetKey().

But there are some questions:

Can we avoid referencing excludeOnly flag keyGetItem() by replacing these
references with !nrequired?

Maybe it would be better to move the whole block of keyGetItem() code
starting from the first loop over required keys and ending before the loop over
additional keys inside 'if (key->nrequired) { ... }'?

Can we avoid introducing excludeOnly flag by reusing searchMode and/or by
moving the initialization of nrequired/nadditional into ginNewScanKey()?


3. The following two times repeated NULL-filtering check looks too complicated
and needs to be refactored somehow:

-       res = key->triConsistentFn(key);
+       if (key->excludeOnly &&
+               key->nuserentries < key->nentries &&
+               key->scanEntry[key->nuserentries]->queryCategory == GIN_CAT_NULL_KEY 
&&
+               key->entryRes[key->nuserentries] == GIN_TRUE)
+               res = GIN_FALSE;
+       else
+               res = key->triConsistentFn(key);

For example, a special consistentFn() can be introduced for such NOT_NULL
scankeys.  Or even a hidden separate one-entry scankey with a trivial
consistentFn() can be added instead of adding hidden entry.


4. forcedRecheck flag that was previously used for discarded empty ALL scankeys
is removed now.  0-entry exclusion keys can appear instead, and their
consistentFn() simply returns constant value.  Could this lead to tangible
overhead in some cases (in comparison to forcedRecheck flag)?


5. A hidden GIN_CAT_EMPTY_QUERY is added only for the first empty ALL-scankey,
NULLs in other columns are filtered out with GIN_CAT_NULL_KEY.  This looks like
asymmetric, and it leads to accelerations is some cases and slowdowns in others
(depending on NULL fractions and their correlations in columns).

The following test shows a significant performance regression of v9:

insert into t select array[i], NULL, NULL from generate_series(1, 1000000) i;

                                      |        Query time, ms
           WHERE condition            | master |   v8   |    v9
---------------------------------------+--------+--------+---------
a @> '{}'                             |    224 |    213 |    212
a @> '{}' and b @> '{}'               |     52 |     57 |    255
a @> '{}' and b @> '{}' and c @> '{}' |     51 |     58 |    290


In the older version of the patch I tried to do the similar things (initialize
only one NOT_NULL entry for the first column), but refused to do this in v8.

So, to avoid slowdowns relative to master, I can offer simply to add
GIN_CAT_EMPTY_QUERY entry for each column with empty ALL-keys if there are
no normal keys.


Yeah, I can confirm those results, although on my system the timings are
a bit different (I haven't tested v8):

                                       |        Query time, ms
            WHERE condition            | master |    v9
---------------------------------------+--------+---------
 a @> '{}'                             |    610 |    589
 a @> '{}' and b @> '{}'               |    185 |    665
 a @> '{}' and b @> '{}' and c @> '{}' |    185 |    741

So that's something we probably need to address, perhaps by using the
GIN_CAT_EMPTY_QUERY entries as proposed.

I've also tested this on a database storing mailing lists archives with
a trigram index, and in that case the performance with short values gets
much better. The "messages" table has two text fields with a GIN trigram
index - subject and body, and querying them with short/long values works
like this:

                   WHERE                    |  master  |    v9
 --------------------------------------------------------------
 subject LIKE '%aa%' AND body LIKE '%xx%'   |    4943  |  4052
 subject LIKE '%aaa%' AND body LIKE '%xx%'  |      10  |    10
 subject LIKE '%aa%' AND body LIKE '%xxx%'  |     380  |    13
 subject LIKE '%aaa%' AND BODY LIKE '%xxx%' |       2  |     2

which seems fairly nice. I've done tests with individual columns, and
that seems to be working fine too.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoid full GIN index scan when possible

Reply via email to