Re: [PHP-DEV] Multibyte strings

Rowan Tommins Fri, 11 Feb 2022 01:14:24 -0800

On 11/02/2022 06:26, Michał wrote:

Hi everyone.
It's a known fact that nowadays most websites use at least UTF-8encoding. Unfortunately PHP itself has stopped a bit in the previouscentury. Is there any reason why the mbstring extension cannot beintroduced to core in the next major version (maybe preceded with adeprecation message like it was with the mysql extension in v5)? Allfunctions from the standard library would become aliases for multibyteequivalents.


Hi Michal,

If only it were as simple as that...

You might want to read up on the history of PHP 6.0, the version whichnever happened, because the project to introduce native Unicode stringsturned out to be so complex, and introduce so many performance problems.

There is a hint at part of the complexity in your phrasing "at leastUTF-8 encoding" - there isn't really anything that's "more than" UTF-8,but there are certainly other common encodings - Windows-1252mislabelled as ISO 8859-1 is a common one; UTF-16 has historically beencommon on Windows, and is a more efficient encoding in some contexts. Sohaving PHP simply assume that all data is in UTF-8 won't work, you willalways need to be able to represent a string of bytes and tell PHP tointerpret it as some encoding. There are also many contexts (e.g.processing binary files) where interpreting strings as a sequence ofbytes (as PHP does now) is absolutely correct. PHP 6.0 would havehandled this similar to Python 3, with "binary strings" and "Unicodestrings" as two separate types.

There's also I think a myth in people's minds that something like"string length" has a single meaning, and PHP gets it "wrong" formultibyte strings; but actually the value given by functions likemb_strlen (the number of Unicode code points) is pretty useless -generally, people are actually interested in how many bytes the stringwill take up (as returned by PHP strlen) or how much space it will takeup on screen (a really difficult question, but grapheme_strlen, whichcounts what you'd think of as "letters", is a better bet than countingcode points, which can be individual accents).

There probably *are* things PHP could do to improve Unicode handling,but it needs careful thought to avoid making everything worse.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Multibyte strings

Reply via email to