On Aug 23, 2005, at 11:31 PM, Rolland Santimano wrote:

I meant that 'start'/'length' would be UChar counts and probably fall
in the middle of a lead/trail sequence. But as you mentioned below, the
iteration macros should take care of that.

start/length parameters should always be interpreted in codepoint context, so they should never fall in the middle of a surrogate pair.


That's true about the table - sorry, I wasn't clear the first time
around. I'm thinking of an approach similar to that used in
php_u_trim(), ie. iterating over both strings. Of course, its O(mn)
rather than O(m+n).

ICU has its own u_strtok_r() function, but its limitation is that it
does not support surrogate pairs (which we should).

Err, the ICU docs & code seem to suggest that they DO handle
surrogates.

So, why not use u_strtok_r() for that, instead of messing with a table?

Ok, do we accept combining characters by themselves as delimiters ?

I don't see why not.

-Andrei

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to