On 22.11.2011 21:38, Alexander Korotkov wrote:
WIP patch with index support for regexp search for pg_trgm contrib is attached. In spite of techniques which extracts continuous text parts from regexp, this patch presents technique of automatum transformation. That allows more comprehensive trigrams extraction.
Nice!
Current version of patch have some limitations: 1) Algorithm of logical expression extraction on trigrams have high computational complexity. So, it can become really slow on regexp with many branches. Probably, improvements of this algorithm is possible. 2) Surely, no perfomance benefit if no trigrams can be extracted from regexp. It's inevitably. 3) Currently, only GIN index is supported. There are no serious problems, GiST code for it just not written yet. 4) It appear to be some kind of problem to extract multibyte encoded character from pg_wchar. I've posted question about it here: http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php While I've hardcoded some dirty solution. So PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not supported yet.
This is pretty far from being in committable state, so I'm going to mark this as "returned with feedback" in the commitfest app. The feedback:
The code badly needs comments. There is no explanation of how the trigram extraction code in trgm_regexp.c works. Guessing from the variable names, it seems to be some sort of a coloring algorithm that works on a graph, but that all needs to be explained. Can this algorithm be found somewhere in literature, perhaps? A link to a paper would be nice.
Apart from that, the multibyte issue seems like the big one. Any way around that?
-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers