Re: [PHP-DEV] [Discussion] Scalar Object Strings and Multibyte Encodings

Rowan Collins Thu, 20 Jun 2019 14:20:19 -0700

On 20/06/2019 16:36, Mark Randall wrote:

"Hello".substr(1) // would work as expected regardless of encoding

As I always point out when "multi-byte support" or "Unicode support" isdiscussed, it's often ambiguous just what should be "expected".

A lot of systems go from "each character is one byte" to "each characteris one code point", but that leads to what I call "the noël problem": ifyou reverse the string "noël", the expected behaviour is probably forthe diaeresis to stay on the "e". However, if it is encoded as acombining diacritic, a code point based implementation will place itonto the "l" instead. Similarly, taking the first three "characters"should give "noë" not "noe". Enforcing normalisation helps in this case,because there is a composed form of e+diaeresis, but that's not true forall combinations ("graphemes") you can encode, or for all operations.

Another example is "length"; what practical purpose does "number of codepoints" serve, when some of those code points may be combiners ornon-printing marks? Often, number of bytes (in some encoding, such asUTF-8) is actually the relevant measure; other times, "width on screen"is what is actually required, and very hard to compute.

My point is that any attempt to make the language "do the right thing bydefault" needs serious thought on what "the right thing" is.


Regards,

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [Discussion] Scalar Object Strings and Multibyte Encodings

Reply via email to