On Aug 23, 2005, at 11:31 PM, Rolland Santimano wrote:
I meant that 'start'/'length' would be UChar counts and probably fall
in the middle of a lead/trail sequence. But as you mentioned below, the
iteration macros should take care of that.
start/length parameters should always be interpreted in codepoint
context, so they should never fall in the middle of a surrogate pair.
That's true about the table - sorry, I wasn't clear the first time
around. I'm thinking of an approach similar to that used in
php_u_trim(), ie. iterating over both strings. Of course, its O(mn)
rather than O(m+n).
ICU has its own u_strtok_r() function, but its limitation is that it
does not support surrogate pairs (which we should).
Err, the ICU docs & code seem to suggest that they DO handle
surrogates.
So, why not use u_strtok_r() for that, instead of messing with a table?
Ok, do we accept combining characters by themselves as delimiters ?
I don't see why not.
-Andrei
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php