> > > Some background and history, for those not familiar... > > After PHP 5.2, there was a huge effort to move PHP to using Unicode > internally. It was to be released as PHP 6. Unfortunately, it ran > into a whole host of problems, among them: > > 1. It tried to use UTF-16 internally, as there were good libraries > for it but it was much much slower than was acceptable. > 2. It required rewriting basically everything. > 3. Trying to support two string variants at the same time (because > binary strings are still very useful) in almost the same syntax > turned out be, um, kinda hard. > > After a number of years of work, it was eventually concluded that it > was a dead end. So the non-Unicode-related bits of what would have > been PHP 6 got renamed to PHP 5.3 and released to much fanfare, > kicking off the PHP Renaissance Era. > > When PHP 5.6+1 was released, there was a vote to decide if it should > be called 6 or 7. 7 won, mainly on the grounds that a number of very > stupid book publishers had released "PHP 6" books in anticipation of > PHP 6's release that were now completely useless and misleading. So > we skipped 6 entirely, and PHP 6-compatibility is a running joke > among those who have been around a while. > > Fortunately, the vast majority of single-byte strings are ASCII, and > ASCII is, by design, a strict subset of UTF-8, so in practice the > lack of native UTF-8 strings rarely causes an issue. > > Trying to introduce Unicode strings to the language now as a native > type would... probably break just as much if not more. If anything > it's probably harder today than it was in 2008, because the engine > and existing code to not-break has grown considerably. > > A much better approach would be something like this RFC from Derick a > few years ago: > > https://wiki.php.net/rfc/unicode_text_processing > > If you need something today, then Symfony has a user-space > approximation of it: > > https://symfony.com/doc/current/string.html > > --Larry Garfield
I think that when people think of "strings", they think of human readable text. I wasn't suggesting that unicode strings be a native type, but rather that functions that have "string" in the name should be UTF-8 safe. There's a lot of pitfalls here, and I don't think the documentation clearly calls out which functions are OK to use with UTF-8 and which ones may cause unexpected surprises. The compatibility between ASCII and UTF-8 for Latin characters is both a curse and a blessing. An application may work fine in testing, but then break when a user submits an emoji. It seems like it would be good to have a set of functions, each for an intended use case, that behave in accordance with their intended usage. For example: Math and number functions for calculations; string functions for human readable text (which are UTF-8 safe), and byte functions for binary processing that are binary safe. Using the functions for certain use cases right now requires knowing the internals of the function, where developers should be able to rely on the name to know that it will work for a specific use case. For many functions, the manual doesn't specify if it is safe for multi- byte characters or not. `ltrim` doesn't mention multi-byte: https://www.php.net/manual/en/function.ltrim.php The `trim` page doesn't mention it either, except there is a user contributed note at the bottom: "Note that trim() is not aware of Unicode points that represent whitespace (e.g., in the General Punctuation block), except, of course, for the ones mentioned in this page. There is no Unicode-specific trim function in PHP at the time of writing (July 2023), but you can try some examples of trims using multibyte strings posted on the comments for the mbstring extension: https://www.php.net/manual/en/ref.mbstring.php". So what I would propose is: (1) All string functions should state in the official man page if they are safe for UTF-8 or not. (2) Functions intended for working with text should be made UTF-8 safe. (3) Functions intended for processing binary should be added if necessary, and should be named something like "binary" or "byte".