In bug #6457 it's pointed out that we *still* don't have full functionality for locale-dependent regexp behavior with UTF8 encoding. The reason is that there's old crufty code in regc_locale.c that only considers character codes up to 255 when searching for characters that should be considered "letters", "digits", etc. We could fix that, for some value of "fix", by iterating up to perhaps 0xFFFF when dealing with UTF8 encoding, but the time that would take is unappealing. Especially so considering that this code is executed afresh anytime we compile a regex that requires locale knowledge.
I looked into the upstream Tcl code and observed that they deal with this by having hard-wired tables of which Unicode code points are to be considered letters etc. The tables are directly traceable to the Unicode standard (they provide a script to regenerate them from files available from unicode.org). Nonetheless, I do not find that approach appealing, mainly because we'd be risking deviating from the libc locale code's behavior within regexes when we follow it everywhere else. It seems entirely likely to me that a particular locale setting might consider only some of what Unicode says are letters to be letters. However, we could possibly compromise by using Unicode-derived tables as a guide to which code points are worth probing libc for. That is, assume that a utf8-based locale will never claim that some code is a letter that unicode.org doesn't think is a letter. That would cut the number of required probes by a pretty large factor. The other thing that seems worth doing is to install some caching. We could presumably assume that the behavior of iswupper() et al are fixed for the duration of a database session, so that we only need to run the probe loop once when first asked to create a cvec for a particular category. Thoughts, better ideas? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers