Heikki Linnakangas <hlinnakan...@vmware.com> writes: > I finally got around to look at this. I like this new version, without > the path matrix, much better.
I looked through this version too. I have some notes/issues: The biggest problem is that I really don't care for the idea of contrib/pg_trgm being this cozy with the innards of regex_t. Sooner or later we are going to want to split the regex code off again as a standalone library, and we *must* have a cleaner division of labor if that is ever to happen. Not sure what a suitable API would look like though. I think the assumption that all MB characters fit in 4 bytes is unacceptable; someday we'll want to support wider Unicode characters than we do now, and this code seems utterly unable to handle it. It's especially bad that the code isn't even bothering to defend itself against the possibility of wider characters. Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version and an upgrade script. Comments and documentation still need a lot of copy-editing, also I think a lot of the comment blocks will not survive pg_indent. It'd be a good idea to run trgm_regexp.c through pg_indent as soon as you have that fixed. New file trgm_regexp.c lacks a copyright notice Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good enough; need to pass through the actual collation for the regex operator. How deep can the recursion in activateState() go? Is this a practical production approach at all? Do we need a stack depth check there? addKeys is also recursive, same questions. (But on the other hand, the check_stack_depth in scanColorMap seems useless, since its recursion depth is fixed.) Not too happy with convertPgWchar: aside from hard-wired, unchecked assumption about maximum length of pg_wchar2mb_with_len result, why is it that this is doing a lowercase conversion? Surely the regex stuff dealt with that already? I'm suspicious of the code in addKeys() that is special-casing bos[1] and eos[1] but not the other cases (BOL/EOL). My recollection from working on the fixed-prefix stuff is that it's not always obvious which of those states gets used. strnlen() used in fillTrgm() is probably not portable, and is definitely not used anywhere else in Postgres. This isn't a complete review, just some stuff I happened to notice in one quick pass over the code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers