> On Aug 12, 2024, at 4:25 PM, Rowan Tommins [IMSoP] <imsop....@rwec.co.uk> > wrote: > > On 12/08/2024 17:37, Mike Schinkel wrote: >> A really standout paragraph from that link is: >> >> "IMO, the whole situation is a shame. Unicode should be >> in the stdlib of every language by default. It’s the lingua >> franca of the internet! It’s not even new: we’ve been living >> with Unicode for 20 years now." > > I actually think that paragraph rather ignores everything else the article > has just explained.
You and I had different takeaways then. > and it's not entirely clear what it should even mean. I cannot speak for the author off the article, but I thought I had implied strongly enough what it would mean to me. Evidently I did not, so I will be explicit: Pursue this RFC: https://wiki.php.net/rfc/unicode_text_processing <https://wiki.php.net/rfc/unicode_text_processing> > The main reason it's not *mandatory* for all builds of PHP, just "bundled", > is that the sheer complexity of Unicode means that the library is rather > large Let me see if I understand your argument correctly? You are asserting that Unicode is "too complex" to be handled in the standard library so that complexity should instead be shouldered individually by each and every PHP developer who needs to work with Unicode text in PHP, which is many PHP developers if not eventually most. Is that your argument? Imagine if PHP had taken the position that "It is too complex, so we'll just make userland developers deal with it" regarding cryptography and encryption? Or regular expressions? Or image processing? Or time and date manipulation? Or network and socket programming? > "Putting Unicode in the stdlib" is an incredibly difficult task, and it's not > entirely clear what it should even mean. > ... > somebody (Rasmus, I think?) joked that relying on it for PHP 6 would have > made PHP a small library attached to the side of ICU. You are comparing apples and oranges. Putting Unicode into an existing *language* and integrating with built-in data types in a backward compatible manner is a MUCH bigger lift than "putting Unicode into a standard library." The latter is just providing functions and/or an object and methods for the majority of tasks needed to process Unicode text. PHP already has some functions for Unicode in the standard library as have been mentioned, but not enough to reasonably handle most Unicode text-related tasks. A Unicode text processing class with the existing RFC as a starting point could unify that functionality and fill in the missing gaps. BTW, I have done a significant amount of work with Unicode in Go — which handles code points natively, but unfortunately not grafemes — and handling Unicode effectively is not *that* hard. The rules are many, but they are straightforward. Certainly it is not harder than cryptography and encryption, which PHP addresses in core. > We also have the "mbstring" extension, which was *not* designed around > Unicode, but was originally built for various encodings popular in Japan 20+ > years ago. It doesn't have the databases of codepoint information that ICU > does, so can't answer questions like "what script does this code point belong > to?" or "what is the uppercase equivalent of this grapheme, assuming a > Turkish locale?" Interesting historical factoid, but how is that really relevant to including Unicode into the standard library? -Mike