On 14/10/2014 14:50, Andrea Faulds wrote:
2. What is currently missing in that regard?
Unicode string support.

I know that was probably deliberately flippant, but I think there is a genuine question to be asked here. A lot of people talk about "Unicode support" like they talk about "XPath support"; but XPath is an API you can adhere to, Unicode is a whole lot more (and less) than that.

What it probably means to most people is "string functions which do what I expect with a vast range of obscure Unicode code point sequences". Those expectations need to be documented *before* an API is written, rather than writing a whole load of functions which use a Unicode library, but don't actually provide the tools that people need.

If you want to see a pragmatic, actually working, work-in-progress attempt at 
better PHP unicode support, see this: https://github.com/krakjoe/ustring

It looks like a good prototype, but glancing at the documentation, I'm not clear exactly what the assumptions of some of the functions are.

There's a lot of talk of "characters", which is a *very* slippery notion in Unicode; charAt() returns a single code point, and $length returns a number of code points. This makes me wonder if it will pass "the noël test" [1] - does a combining diacritic move onto a different letter when you run ->reverse()?

As I've mentioned before, a lot of the time what people actually want to deal with is "grapheme clusters" - the kind of thing that you'd think of as a character if you were writing by hand. Most people, if asked the length of the string "noël", would answer 4, but there may be 5 code points. (That's not just a case of normalisation choices; most combinations of letter+diacritic have no single code point, that's why the combining forms exist.)

A good Unicode string API should probably give clear labels and choices for such things - $string->codePointAt(3) is not the same as $string->graphemeAt(3), $string->codePointCount is not the same as $string->graphemeCount, and so forth. A single property $length seems more user-friendly, until the user finds it means something different to what they wanted.

Similarly, an automatic __toString() function is handy, but what encoding does it output, and why? UTF-8? The same encoding that the string was constructed with?

If I know that my database is expecting UTF-8, I probably want to say $string->getByteString('UTF-8'). I may also want to say $string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact number of graphemes into a 20-byte binary space; something that neither $string->substring(0, 20)->getByteString('UTF-8') nor substr( $string->getByteString('UTF-8'), 0, 20 ) can do.

In short, we can only abstract so much - supporting Unicode automatically means supporting its complexity, not just pretending it's a really big version of ASCII.

[1] http://mortoray.com/2013/11/27/the-string-type-is-broken/

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to