On 11/02/2022 06:26, Michał wrote:
Hi everyone.
It's a known fact that nowadays most websites use at least UTF-8 encoding. Unfortunately PHP itself has stopped a bit in the previous century. Is there any reason why the mbstring extension cannot be introduced to core in the next major version (maybe preceded with a deprecation message like it was with the mysql extension in v5)? All functions from the standard library would become aliases for multibyte equivalents.


Hi Michal,

If only it were as simple as that...

You might want to read up on the history of PHP 6.0, the version which never happened, because the project to introduce native Unicode strings turned out to be so complex, and introduce so many performance problems.

There is a hint at part of the complexity in your phrasing "at least UTF-8 encoding" - there isn't really anything that's "more than" UTF-8, but there are certainly other common encodings - Windows-1252 mislabelled as ISO 8859-1 is a common one; UTF-16 has historically been common on Windows, and is a more efficient encoding in some contexts. So having PHP simply assume that all data is in UTF-8 won't work, you will always need to be able to represent a string of bytes and tell PHP to interpret it as some encoding. There are also many contexts (e.g. processing binary files) where interpreting strings as a sequence of bytes (as PHP does now) is absolutely correct. PHP 6.0 would have handled this similar to Python 3, with "binary strings" and "Unicode strings" as two separate types.

There's also I think a myth in people's minds that something like "string length" has a single meaning, and PHP gets it "wrong" for multibyte strings; but actually the value given by functions like mb_strlen (the number of Unicode code points) is pretty useless - generally, people are actually interested in how many bytes the string will take up (as returned by PHP strlen) or how much space it will take up on screen (a really difficult question, but grapheme_strlen, which counts what you'd think of as "letters", is a better bet than counting code points, which can be individual accents).

There probably *are* things PHP could do to improve Unicode handling, but it needs careful thought to avoid making everything worse.

Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to