On Aug 23, 2005, at 8:23 AM, Rolland Santimano wrote:
[1] string substr_replace(string original, string new, int start[, int
length])
Returns string where original[start..length] is replaced with
new. Input args can be arrays, in which case case the operation is:
substr_replace(original[i], new[i], start[i], length[i])
Impl:
The current impl is written in terms of memcpy(), after adjusting
satrt & length correctly. With Unicode input, 'start' & 'length' may
not be aligned with codepoint/grapheme boundaries. If args are mixed
string types, convert to common type.

What do you mean, they may not be aligned with codepoint boundaries? We have to make sure that they are. In order to do this, we need to use U16_FWD() macro to iterate through the number of codepoints indicated by 'start', and then from that point do the same for 'length'. Once that's done you will have the boundaries in terms of UChar*'s.

[2] int substr_count(string text, string token[, int start[, int
length]])
Returns no of occurrences of token in text[start..length]
Impl:
The current impl is around php_memnstr() and can be extended for
Unicode with zend_u_memnstr()

Same thing with regard to start and length applies here.

[3] string strtok([string text, ]string separator)
Tokenize string
Impl:
Current impl uses global state, in the form of char ptrs and a
256-char array. Mixed string type input would be converted to common
type, and new global state would have to include initial type of
separator. Tokenizing should honor base+combining sequences.

I think we need to flesh out more details here. We can't possibly keep a strtok table the size of the entire Unicode set. ICU has its own u_strtok_r() function, but its limitation is that it does not support surrogate pairs (which we should). As for honoring base+combining sequences, why should strtok() be any more special than strstr()?

[5] string str_pad(string text, int length[, string pad[, int
pad_type]])
Returns input string padded on the left and/or right (determined by
pad_type) to specified length with pad string.
Impl:
The impl builds the output string by copying appropriate pad
characters to the left and/or right of the input string.
Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2
(lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then
the 'pad' text can't be added at either end. More generally, the 'pad'
text can't be split in the middle of non-BMP codepts or base+combining
sequences. If such a condn occurs, an error should be returned. Any
other thoughts ?

We should not split padding strings in the middle of surrogate pairs. As for combining sequences, I would defer to Tex and see what he has to say. The input length parameter should indicate the number of codepoints to pad to, not the number of UChars.

[7] int levenshtein(string str1, string str2[, int ins_cost, int
rep_cost, int del_cost])
Calculate Levenshtein distance between str1 & str2.

Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should
the ins/del/subst cost be expressed in graphemes or codepts ?

I think it can be fairly easily extended to Unicode strings, since the algorithm only cares about the insertion, deletion, or substitution of characters. We should once again work on the codepoint level.

The foll funcns generally work on ASCII input, and should be made
Unicode-aware. However, should they be converted to process Unicode
input ?

[1] string addslashes(string text)
[2] string stripslashes(string text)
Escape single/double quotes & backslashes with backslashes

I don't see any problems with these two.

[3] string addcslashes(string text, string charlist)
[4] string stripcslashes(string text)
Escape chars < 32 or > 126 with octal sequences, and escape characters
from charlist with backspace.

Same here.

[5] string strip_tags(string text[, string allowed_tags])
Strip HTML/PHP tags from text

Should be ok, but I think we'll end up duplicating a large chunk of code..

-Andrei

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to