On 20/06/2019 16:36, Mark Randall wrote:
"Hello".substr(1) // would work as expected regardless of encoding
As I always point out when "multi-byte support" or "Unicode support" is discussed, it's often ambiguous just what should be "expected".
A lot of systems go from "each character is one byte" to "each character is one code point", but that leads to what I call "the noël problem": if you reverse the string "noël", the expected behaviour is probably for the diaeresis to stay on the "e". However, if it is encoded as a combining diacritic, a code point based implementation will place it onto the "l" instead. Similarly, taking the first three "characters" should give "noë" not "noe". Enforcing normalisation helps in this case, because there is a composed form of e+diaeresis, but that's not true for all combinations ("graphemes") you can encode, or for all operations.
Another example is "length"; what practical purpose does "number of code points" serve, when some of those code points may be combiners or non-printing marks? Often, number of bytes (in some encoding, such as UTF-8) is actually the relevant measure; other times, "width on screen" is what is actually required, and very hard to compute.
My point is that any attempt to make the language "do the right thing by default" needs serious thought on what "the right thing" is.
Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php