On Sun, 8 Sep 2002, Joel Mawhorter wrote: > Wouldn't it make more sense to use UTF-16 than UTF-8 in regular expressions. > At least with UTF-16, in most cases, 1 character == 1 symbol so regular > expressions would be more managable (e.g. what does a dot mean in a regular > expression when being matched against symbols that can be represented in 1,2 > or 3 chars?). Does ICU have regular expression support? I know the regular > expression support in Java 1.4 is very nice and uses UTF-16 but alas we can't > really use that in Sword unless we come up with a CNNI (C non-native > interface :-).
Nope. Sword is entirely UTF-8 internally. Perl just happens to be the same. Perl has a nice regex implementation built on UTF-8. In Perl, a dot means a character. Regexes should operate on characters, not bytes, after all. No, ICU doesn't have any regex support. It's almost entirely devoted to i18n/l10n stuff, though it does have a simple io library. Using code that works with UTF-8 also benefits us by not requiring that we convert to/from UTF-16. --Chris