Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Larry Garfield Sun, 11 Aug 2024 10:05:59 -0700

On Sun, Aug 11, 2024, at 10:50 AM, Nick Lockheart wrote:
> HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
> the UTF-8 multi-byte character encoding.
>
> It seems like there's still a lot of string functions that assume that
> a character is a single byte, and these may actually work as expected
> when dealing with Latin characters, but may fail unexpectedly if a
> sequence is more than one byte.
>
> Are there any use cases for PHP where **single-byte** characters are
> the norm?
>
> It seems that if everything on the Internet is multi-byte encoded now,
> then all of the PHP string functions should be multi-byte safe.
>
>
> The WHATWG Encoding Standard:
>
> https://encoding.spec.whatwg.org/
>
> Also, according to Mozilla, "[The meta charset] attribute declares the
> document's character encoding. If the attribute is present, its value
> must be an ASCII case-insensitive match for the string "utf-8", because
> UTF-8 is the only valid encoding for HTML5 documents."
>
> https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset


Some background and history, for those not familiar...

After PHP 5.2, there was a huge effort to move PHP to using Unicode internally. 
 It was to be released as PHP 6.  Unfortunately, it ran into a whole host of 
problems, among them:

1. It tried to use UTF-16 internally, as there were good libraries for it but 
it was much much slower than was acceptable.
2. It required rewriting basically everything.
3. Trying to support two string variants at the same time (because binary 
strings are still very useful) in almost the same syntax turned out be, um, 
kinda hard.

After a number of years of work, it was eventually concluded that it was a dead 
end.  So the non-Unicode-related bits of what would have been PHP 6 got renamed 
to PHP 5.3 and released to much fanfare, kicking off the PHP Renaissance Era.

When PHP 5.6+1 was released, there was a vote to decide if it should be called 
6 or 7.  7 won, mainly on the grounds that a number of very stupid book 
publishers had released "PHP 6" books in anticipation of PHP 6's release that 
were now completely useless and misleading.  So we skipped 6 entirely, and PHP 
6-compatibility is a running joke among those who have been around a while.

Fortunately, the vast majority of single-byte strings are ASCII, and ASCII is, 
by design, a strict subset of UTF-8, so in practice the lack of native UTF-8 
strings rarely causes an issue.

Trying to introduce Unicode strings to the language now as a native type 
would... probably break just as much if not more.  If anything it's probably 
harder today than it was in 2008, because the engine and existing code to 
not-break has grown considerably.

A much better approach would be something like this RFC from Derick a few years 
ago:

https://wiki.php.net/rfc/unicode_text_processing

If you need something today, then Symfony has a user-space approximation of it: 

https://symfony.com/doc/current/string.html

--Larry Garfield

Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Reply via email to