Bug #5766 points out that we're still not there yet in terms of having sane behavior for locale-specific regex operations in Unicode encoding. The reason it's not working is that regc_locale does this to expand the set of characters that are considered to match [[:alnum:]] :
/* * Now compute the character class contents. * * For the moment, assume that only char codes < 256 can be in these * classes. */ ... case CC_ALNUM: cv = getcvec(v, UCHAR_MAX, 0); if (cv) { for (i = 0; i <= UCHAR_MAX; i++) { if (pg_wc_isalnum((chr) i)) addchr(cv, (chr) i); } } break; This is a leftover from when we weren't trying to behave sanely for multibyte encodings. Now that we are, it's clearly not good enough. But iterating up to many thousands in this loop isn't too appetizing from a performance standpoint. I looked at the equivalent place in Tcl, and I see that what they're currently doing is they have a hard-wired list of all the Unicode code points that are classified as alnum, punct, etc. We could duplicate that (and use it only if encoding is UTF8), but it seems kind of ugly, and it doesn't respect the idea that the locale setting ought to control which characters are considered to be in each class. Another possibility is to take those lists but apply iswpunct() and friends to the values, including only code points that pass in the finished set. So what you get is the intersection of the Unicode list and the locale behavior. Some of the performance pressure could be taken off if we cached the results instead of recomputing them every time a regex uses the character classification; but I'm not sure how much that would save. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers