On 11/02/2022 06:26, Michał wrote:
Hi everyone.
It's a known fact that nowadays most websites use at least UTF-8
encoding. Unfortunately PHP itself has stopped a bit in the previous
century. Is there any reason why the mbstring extension cannot be
introduced to core in the next major version (maybe preceded with a
deprecation message like it was with the mysql extension in v5)? All
functions from the standard library would become aliases for multibyte
equivalents.
Hi Michal,
If only it were as simple as that...
You might want to read up on the history of PHP 6.0, the version which
never happened, because the project to introduce native Unicode strings
turned out to be so complex, and introduce so many performance problems.
There is a hint at part of the complexity in your phrasing "at least
UTF-8 encoding" - there isn't really anything that's "more than" UTF-8,
but there are certainly other common encodings - Windows-1252
mislabelled as ISO 8859-1 is a common one; UTF-16 has historically been
common on Windows, and is a more efficient encoding in some contexts. So
having PHP simply assume that all data is in UTF-8 won't work, you will
always need to be able to represent a string of bytes and tell PHP to
interpret it as some encoding. There are also many contexts (e.g.
processing binary files) where interpreting strings as a sequence of
bytes (as PHP does now) is absolutely correct. PHP 6.0 would have
handled this similar to Python 3, with "binary strings" and "Unicode
strings" as two separate types.
There's also I think a myth in people's minds that something like
"string length" has a single meaning, and PHP gets it "wrong" for
multibyte strings; but actually the value given by functions like
mb_strlen (the number of Unicode code points) is pretty useless -
generally, people are actually interested in how many bytes the string
will take up (as returned by PHP strlen) or how much space it will take
up on screen (a really difficult question, but grapheme_strlen, which
counts what you'd think of as "letters", is a better bet than counting
code points, which can be individual accents).
There probably *are* things PHP could do to improve Unicode handling,
but it needs careful thought to avoid making everything worse.
Regards,
--
Rowan Tommins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php