Re: [PHP-DEV] Unicode support

Rowan Collins Tue, 14 Oct 2014 11:02:15 -0700

On 14/10/2014 14:50, Andrea Faulds wrote:

2. What is currently missing in that regard?

Unicode string support.

I know that was probably deliberately flippant, but I think there is agenuine question to be asked here. A lot of people talk about "Unicodesupport" like they talk about "XPath support"; but XPath is an API youcan adhere to, Unicode is a whole lot more (and less) than that.

What it probably means to most people is "string functions which do whatI expect with a vast range of obscure Unicode code point sequences".Those expectations need to be documented *before* an API is written,rather than writing a whole load of functions which use a Unicodelibrary, but don't actually provide the tools that people need.

If you want to see a pragmatic, actually working, work-in-progress attempt at 
better PHP unicode support, see this: https://github.com/krakjoe/ustring

It looks like a good prototype, but glancing at the documentation, I'mnot clear exactly what the assumptions of some of the functions are.

There's a lot of talk of "characters", which is a *very* slippery notionin Unicode; charAt() returns a single code point, and $length returns anumber of code points. This makes me wonder if it will pass "the noëltest" [1] - does a combining diacritic move onto a different letter whenyou run ->reverse()?

As I've mentioned before, a lot of the time what people actually want todeal with is "grapheme clusters" - the kind of thing that you'd think ofas a character if you were writing by hand. Most people, if asked thelength of the string "noël", would answer 4, but there may be 5 codepoints. (That's not just a case of normalisation choices; mostcombinations of letter+diacritic have no single code point, that's whythe combining forms exist.)

A good Unicode string API should probably give clear labels and choicesfor such things - $string->codePointAt(3) is not the same as$string->graphemeAt(3), $string->codePointCount is not the same as$string->graphemeCount, and so forth. A single property $length seemsmore user-friendly, until the user finds it means something different towhat they wanted.

Similarly, an automatic __toString() function is handy, but whatencoding does it output, and why? UTF-8? The same encoding that thestring was constructed with?

If I know that my database is expecting UTF-8, I probably want to say$string->getByteString('UTF-8'). I may also want to say$string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact numberof graphemes into a 20-byte binary space; something that neither$string->substring(0, 20)->getByteString('UTF-8') nor substr($string->getByteString('UTF-8'), 0, 20 ) can do.

In short, we can only abstract so much - supporting Unicodeautomatically means supporting its complexity, not just pretending it'sa really big version of ASCII.


[1] http://mortoray.com/2013/11/27/the-string-type-is-broken/

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode support

Reply via email to