[HACKERS] Regex code versus Unicode chars beyond codepoint 255

Tom Lane Wed, 24 Nov 2010 12:57:02 -0800

Bug #5766 points out that we're still not there yet in terms of having
sane behavior for locale-specific regex operations in Unicode encoding.
The reason it's not working is that regc_locale does this to expand
the set of characters that are considered to match [[:alnum:]] :


        /*
         * Now compute the character class contents.
         *
         * For the moment, assume that only char codes < 256 can be in these
         * classes.
         */
        ...
                case CC_ALNUM:
                        cv = getcvec(v, UCHAR_MAX, 0);
                        if (cv)
                        {
                                for (i = 0; i <= UCHAR_MAX; i++)
                                {
                                        if (pg_wc_isalnum((chr) i))
                                                addchr(cv, (chr) i);
                                }
                        }
                        break;

This is a leftover from when we weren't trying to behave sanely for
multibyte encodings.  Now that we are, it's clearly not good enough.
But iterating up to many thousands in this loop isn't too appetizing
from a performance standpoint.

I looked at the equivalent place in Tcl, and I see that what they're
currently doing is they have a hard-wired list of all the Unicode
code points that are classified as alnum, punct, etc.  We could
duplicate that (and use it only if encoding is UTF8), but it seems
kind of ugly, and it doesn't respect the idea that the locale setting
ought to control which characters are considered to be in each class.

Another possibility is to take those lists but apply iswpunct() and
friends to the values, including only code points that pass in the
finished set.  So what you get is the intersection of the Unicode list
and the locale behavior.

Some of the performance pressure could be taken off if we cached
the results instead of recomputing them every time a regex uses
the character classification; but I'm not sure how much that would
save.

Thoughts?

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Regex code versus Unicode chars beyond codepoint 255

Reply via email to