On 14 Oct 2014, at 19:01, Rowan Collins <rowan.coll...@gmail.com> wrote:

> 
>> If you want to see a pragmatic, actually working, work-in-progress attempt 
>> at better PHP unicode support, see this: https://github.com/krakjoe/ustring
> 
> It looks like a good prototype, but glancing at the documentation, I'm not 
> clear exactly what the assumptions of some of the functions are.
> 
> There's a lot of talk of "characters", which is a *very* slippery notion in 
> Unicode; charAt() returns a single code point, and $length returns a number 
> of code points. This makes me wonder if it will pass "the noël test" [1] - 
> does a combining diacritic move onto a different letter when you run 
> ->reverse()?
> 
> As I've mentioned before, a lot of the time what people actually want to deal 
> with is "grapheme clusters" - the kind of thing that you'd think of as a 
> character if you were writing by hand. Most people, if asked the length of 
> the string "noël", would answer 4, but there may be 5 code points. (That's 
> not just a case of normalisation choices; most combinations of 
> letter+diacritic have no single code point, that's why the combining forms 
> exist.)
> 
> A good Unicode string API should probably give clear labels and choices for 
> such things - $string->codePointAt(3) is not the same as 
> $string->graphemeAt(3), $string->codePointCount is not the same as 
> $string->graphemeCount, and so forth. A single property $length seems more 
> user-friendly, until the user finds it means something different to what they 
> wanted.

This is true. It ought to talk about code points but doesn’t. Length is 
primarily needed for iterating through strings and the like. If you went length 
in characters, you probably need to implement your own algorithm, as it really 
depends on your specific use case.

It will, however, always produce valid UTF8 strings for output. That’s better 
than standard string functions which can mangle UTF8.

> Similarly, an automatic __toString() function is handy, but what encoding 
> does it output, and why? UTF-8? The same encoding that the string was 
> constructed with?

Always UTF-8.

> If I know that my database is expecting UTF-8, I probably want to say 
> $string->getByteString('UTF-8’).

You can do that.

> I may also want to say $string->getByteStringWithMaxLength('UTF-8', 20) to 
> fit an exact number of graphemes into a 20-byte binary space; something that 
> neither $string->substring(0, 20)->getByteString('UTF-8') nor substr( 
> $string->getByteString('UTF-8'), 0, 20 ) can do.

I’m not sure quite how you’d do that. There might be a function in mbstring for 
that.

> In short, we can only abstract so much - supporting Unicode automatically 
> means supporting its complexity, not just pretending it's a really big 
> version of ASCII.

Sure. But just handling code points safely is hard enough as it is. This 
handles that. It doesn’t handle characters, sure, but it’s a start. And for 
many applications, you do not need to handle characters.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to