On Oct 22, 2005, at 2:31 AM, Rolland Santimano wrote:

Ah, right. But, I see that context/locale-sensitive full-case mapping
is only possible via u_strToLower(), which works on code units. And I
can't find any ICU funcn that works with codepts.

No, u_strToLower() works on code points. The fact that it takes a UChar* parameter doesn't mean that it works on code units.

Should I use simple case mapping for the time being ? I'll hold the
impl of strripos(), str_replace() & str_ireplace() pending some
clarity on this issue. I'll also need to re-work stripos() and
stristr() after that.

I've CC'ed Tex to get his input on this.

u_strToLower() will not actually helps us with stristr() and others, since:

        a) we actually want case-folding, not lowercasing
b) we do not want to lowercase both strings and compare them (it's expensive)

One of the ICU guys had the following suggestion, when I asked about case insensitive version of u_strstr() a while back:

1. Go one step further and make your string search language-sensitive,
using ICU's string search API (which is based on collation). See
http://icu.sourceforge.net/userguide/searchString.html

2. Use ICU regular expressions. It currently does not handle case
foldings well that map a single character (like ?) to multiple (like
ss).

3. Look at the implementation behind functions like u_strcasecmp() and
try to adapt it to a string search. The implementation case-folds both
strings incrementally. For a search, you would want to case-fold the
pattern beforehand, but not the text in which you are searching.

4. You might try the following: Take the first character in the
pattern and get the set of all characters that have the same case
folding (see the UnicodeSet/USet API). Then search in the string for
the occurrence of any one of the set items (which include strings!).
Then do a case-insensitive comparison, allowing a match that does not
end with the end of the text.

The problematic cases are of course those ?->ss and similar. The
collation-based string search API has settings for whether you want to
find "sta" in "Flu?tal" and such (the pattern matches the second half
of a text character).

Usually, users seem to be happy to do 1. or 2. Long-term, we would
like to beef up the regex implementation to handle more complicated
case foldings and also canonically equivalent (i.e., normalization)
variants.

I am leaning towards 3 or 4. Not sure which one would be faster, but we definitely would want to write a generic function that does case-insensitive search and re-use it in stristr(), stri_replace(), and others.

-Andrei

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to