On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <e...@xs4all.nl> wrote:
> On Fri, June 21, 2013 05:25, Tom Lane wrote: > > "Erik Rijkers" <e...@xs4all.nl> writes: > >> In a 112 MB test table (containing random generated text) with a trgm > index (gin_trgm_ops), I consistently get these > >> timings: > >> select txt from azjunk6 where txt ~ '^abcd'; > >> 130 ms > >> select txt from azjunk6 > >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; > >> 3 ms > > > > Hm, could you provide a self-contained test case? > > > > yes, sorry. I tested on a 1M row table: > > #!/bin/sh > > # create table: > for power in 6; > do > table=azjunk${power} > index=${table}_trgm_re_idx > perl -E' > sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; > say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. > 1e'"${power};" \ > | psql -aqXc " > drop table if exists $table; > create table $table(txt text); > copy $table from stdin;"; > echo "set session maintenance_work_mem='1GB'; > create index $index on $table using gin (txt gin_trgm_ops); > analyze $table;" | psql -qtAX; > done > > # test: > echo " > \\timing on > explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 > ms) > explain analyze select txt from azjunk6 where txt ~ 'abcd' and > substr(txt,1,4) = 'abcd'; -- fast (5 ms) > " | psql -Xqa Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'. However trigrams '__a' is much more frequent than '_ab' which in turn is much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of '__a' and '_ab' and that gives so significant speedup. The problem is that trigram regex engine doesn't know that '__a' and '_ab' is more frequent than another trigrams because we don't know their distribution. However, simple assumption that blank character (in pg_trgm meaning) is much more frequent than others seems to be true for most cases. Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE macro which is used in selectColorTrigrams like count of characters in COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we are intending to scan less posting lists/trees. Comments is not fixed yet, coming soon. ------ With best regards, Alexander Korotkov.
trgm_regex_optimize.1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers