--- Andrei Zmievski <[EMAIL PROTECTED]> wrote: > On Aug 23, 2005, at 8:23 AM, Rolland Santimano wrote: > > [1] string substr_replace > > Impl: > > The current impl is written in terms of memcpy(), after adjusting > > satrt & length correctly. With Unicode input, 'start' & 'length' > > may not be aligned with codepoint/grapheme boundaries. If args are > > mixed string types, convert to common type. > > What do you mean, they may not be aligned with codepoint boundaries?
I meant that 'start'/'length' would be UChar counts and probably fall in the middle of a lead/trail sequence. But as you mentioned below, the iteration macros should take care of that. > We have to make sure that they are. In order to do this, we need to > use U16_FWD() macro to iterate through the number of codepoints > indicated by 'start', and then from that point do the same for > 'length'. Once that's done you will have the boundaries in terms of > UChar*'s. <snip> > > [3] string strtok([string text, ]string separator) > > Tokenize string > > Impl: > > Current impl uses global state, in the form of char ptrs and a > > 256-char array. Mixed string type input would be converted to > > common type, and new global state would have to include initial > > type of separator. Tokenizing should honor base+combining sequences. > > I think we need to flesh out more details here. We can't possibly > keep a strtok table the size of the entire Unicode set. That's true about the table - sorry, I wasn't clear the first time around. I'm thinking of an approach similar to that used in php_u_trim(), ie. iterating over both strings. Of course, its O(mn) rather than O(m+n). > ICU has its own u_strtok_r() function, but its limitation is that it > does not support surrogate pairs (which we should). Err, the ICU docs & code seem to suggest that they DO handle surrogates. > As for honoring base+combining sequences, why should strtok() be any > more special than strstr()? Ok, do we accept combining characters by themselves as delimiters ? <snip> Thanks, Rolland -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php