On Wed, 2009-09-09 at 22:53 +0100, Neil Jerram wrote: > > It is important. This is one of the problems with the whole Unicode > > effort. There is no Unicode-capable regex library. The regexp.test > > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string > > to prep the string for dispatch to the libc regex calls and > > scm_from_locale_string to send them back.
[...] > Thanks for explaining; I think I understand now. So then Ludovic's > suggestion of with-latin1-locale should work, shouldn't it? Yeah. I went with that idea. > > > This regex library actually can be used with arbitrary Unicode data > > but it takes extra care. UTF-8 can be used as the locale, and, then > > regular expression must be written keeping in mind that each non-ASCII > > character is really a multibyte string. > > Can you give an example of what that ("keeping in mind...") means? Is > it being careful with repetition counts (as in "[a-z]{3}"), for > example? I'm not much of a regex guy, but, here's a couple of examples. First one that sort of works as expected. guile> (string-match "sé" "José") ==> #("José" (2 . 5)) Regex properly matches the word, but, the match struct (2 . 5) is referring to the bytes of the string, not the characters of the string. Here's one that doesn't work as expected. guile> (string-match "[:lower:]" "Hi, mom") ==> #("Hi, mom" (5 . 6)) guile> (string-match "[:lower:]" "Hí, móm") ==> #f Once you add accents on the vowels, nothing matches. Thanks, Mike