On 14/10/2014 14:50, Andrea Faulds wrote:
2. What is currently missing in that regard?
Unicode string support.
I know that was probably deliberately flippant, but I think there is a
genuine question to be asked here. A lot of people talk about "Unicode
support" like they talk about "XPath support"; but XPath is an API you
can adhere to, Unicode is a whole lot more (and less) than that.
What it probably means to most people is "string functions which do what
I expect with a vast range of obscure Unicode code point sequences".
Those expectations need to be documented *before* an API is written,
rather than writing a whole load of functions which use a Unicode
library, but don't actually provide the tools that people need.
If you want to see a pragmatic, actually working, work-in-progress attempt at
better PHP unicode support, see this: https://github.com/krakjoe/ustring
It looks like a good prototype, but glancing at the documentation, I'm
not clear exactly what the assumptions of some of the functions are.
There's a lot of talk of "characters", which is a *very* slippery notion
in Unicode; charAt() returns a single code point, and $length returns a
number of code points. This makes me wonder if it will pass "the noël
test" [1] - does a combining diacritic move onto a different letter when
you run ->reverse()?
As I've mentioned before, a lot of the time what people actually want to
deal with is "grapheme clusters" - the kind of thing that you'd think of
as a character if you were writing by hand. Most people, if asked the
length of the string "noël", would answer 4, but there may be 5 code
points. (That's not just a case of normalisation choices; most
combinations of letter+diacritic have no single code point, that's why
the combining forms exist.)
A good Unicode string API should probably give clear labels and choices
for such things - $string->codePointAt(3) is not the same as
$string->graphemeAt(3), $string->codePointCount is not the same as
$string->graphemeCount, and so forth. A single property $length seems
more user-friendly, until the user finds it means something different to
what they wanted.
Similarly, an automatic __toString() function is handy, but what
encoding does it output, and why? UTF-8? The same encoding that the
string was constructed with?
If I know that my database is expecting UTF-8, I probably want to say
$string->getByteString('UTF-8'). I may also want to say
$string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact number
of graphemes into a 20-byte binary space; something that neither
$string->substring(0, 20)->getByteString('UTF-8') nor substr(
$string->getByteString('UTF-8'), 0, 20 ) can do.
In short, we can only abstract so much - supporting Unicode
automatically means supporting its complexity, not just pretending it's
a really big version of ASCII.
[1] http://mortoray.com/2013/11/27/the-string-type-is-broken/
--
Rowan Collins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php