--- Andrei Zmievski <[EMAIL PROTECTED]> wrote:
> On Aug 23, 2005, at 8:23 AM, Rolland Santimano wrote:
> > [1] string substr_replace
> > Impl:
> > The current impl is written in terms of memcpy(), after adjusting
> > satrt & length correctly. With Unicode input, 'start' & 'length'
> > may not be aligned with codepoint/grapheme boundaries. If args are
> > mixed string types, convert to common type.
> 
> What do you mean, they may not be aligned with codepoint boundaries?

I meant that 'start'/'length' would be UChar counts and probably fall
in the middle of a lead/trail sequence. But as you mentioned below, the
iteration macros should take care of that.

> We have to make sure that they are. In order to do this, we need to
> use U16_FWD() macro to iterate through the number of codepoints
> indicated by 'start', and then from that point do the same for
> 'length'. Once that's done you will have the boundaries in terms of
> UChar*'s.

<snip>

> > [3] string strtok([string text, ]string separator)
> > Tokenize string
> > Impl:
> > Current impl uses global state, in the form of char ptrs and a
> > 256-char array. Mixed string type input would be converted to
> > common type, and new global state would have to include initial
> > type of separator. Tokenizing should honor base+combining
sequences.
> 
> I think we need to flesh out more details here. We can't possibly
> keep a strtok table the size of the entire Unicode set.

That's true about the table - sorry, I wasn't clear the first time
around. I'm thinking of an approach similar to that used in
php_u_trim(), ie. iterating over both strings. Of course, its O(mn)
rather than O(m+n).

> ICU has its own u_strtok_r() function, but its limitation is that it
> does not support surrogate pairs (which we should).

Err, the ICU docs & code seem to suggest that they DO handle
surrogates.

> As for honoring base+combining sequences, why should strtok() be any
> more special than strstr()?

Ok, do we accept combining characters by themselves as delimiters ?

<snip>

Thanks,
Rolland

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to