On Sun, Aug 11, 2024, at 10:50 AM, Nick Lockheart wrote: > HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports > the UTF-8 multi-byte character encoding. > > It seems like there's still a lot of string functions that assume that > a character is a single byte, and these may actually work as expected > when dealing with Latin characters, but may fail unexpectedly if a > sequence is more than one byte. > > Are there any use cases for PHP where **single-byte** characters are > the norm? > > It seems that if everything on the Internet is multi-byte encoded now, > then all of the PHP string functions should be multi-byte safe. > > > The WHATWG Encoding Standard: > > https://encoding.spec.whatwg.org/ > > Also, according to Mozilla, "[The meta charset] attribute declares the > document's character encoding. If the attribute is present, its value > must be an ASCII case-insensitive match for the string "utf-8", because > UTF-8 is the only valid encoding for HTML5 documents." > > https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset
Some background and history, for those not familiar... After PHP 5.2, there was a huge effort to move PHP to using Unicode internally. It was to be released as PHP 6. Unfortunately, it ran into a whole host of problems, among them: 1. It tried to use UTF-16 internally, as there were good libraries for it but it was much much slower than was acceptable. 2. It required rewriting basically everything. 3. Trying to support two string variants at the same time (because binary strings are still very useful) in almost the same syntax turned out be, um, kinda hard. After a number of years of work, it was eventually concluded that it was a dead end. So the non-Unicode-related bits of what would have been PHP 6 got renamed to PHP 5.3 and released to much fanfare, kicking off the PHP Renaissance Era. When PHP 5.6+1 was released, there was a vote to decide if it should be called 6 or 7. 7 won, mainly on the grounds that a number of very stupid book publishers had released "PHP 6" books in anticipation of PHP 6's release that were now completely useless and misleading. So we skipped 6 entirely, and PHP 6-compatibility is a running joke among those who have been around a while. Fortunately, the vast majority of single-byte strings are ASCII, and ASCII is, by design, a strict subset of UTF-8, so in practice the lack of native UTF-8 strings rarely causes an issue. Trying to introduce Unicode strings to the language now as a native type would... probably break just as much if not more. If anything it's probably harder today than it was in 2008, because the engine and existing code to not-break has grown considerably. A much better approach would be something like this RFC from Derick a few years ago: https://wiki.php.net/rfc/unicode_text_processing If you need something today, then Symfony has a user-space approximation of it: https://symfony.com/doc/current/string.html --Larry Garfield