Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Mike Schinkel Fri, 16 Aug 2024 11:50:57 -0700

> On Aug 12, 2024, at 4:25 PM, Rowan Tommins [IMSoP] <imsop....@rwec.co.uk> 
> wrote:
> 
> On 12/08/2024 17:37, Mike Schinkel wrote:
>> A really standout paragraph from that link is:
>> 
>> "IMO, the whole situation is a shame. Unicode should be
>> in the stdlib of every language by default. It’s the lingua
>> franca of the internet! It’s not even new: we’ve been living
>> with Unicode for 20 years now."
> 
> I actually think that paragraph rather ignores everything else the article 
> has just explained.


You and I had different takeaways then.

> and it's not entirely clear what it should even mean.

I cannot speak for the author off the article, but I thought I had implied 
strongly enough what it would mean to me. Evidently I did not, so I will be 
explicit:

Pursue this RFC:  https://wiki.php.net/rfc/unicode_text_processing 
<https://wiki.php.net/rfc/unicode_text_processing>

> The main reason it's not *mandatory* for all builds of PHP, just "bundled", 
> is that the sheer complexity of Unicode means that the library is rather 
> large 

Let me see if I understand your argument correctly?  You are asserting that 
Unicode is "too complex" to be handled in the standard library so that 
complexity should instead be shouldered individually by each and every PHP 
developer who needs to work with Unicode text in PHP, which is many PHP 
developers if not eventually most. Is that your argument?

Imagine if PHP had taken the position that "It is too complex, so we'll just 
make userland developers deal with it" regarding cryptography and encryption? 
Or regular expressions?  Or image processing?  Or time and date manipulation? 
Or network and socket programming?

> "Putting Unicode in the stdlib" is an incredibly difficult task, and it's not 
> entirely clear what it should even mean.
> ...
> somebody (Rasmus, I think?) joked that relying on it for PHP 6 would have 
> made PHP a small library attached to the side of ICU.

You are comparing apples and oranges. 

Putting Unicode into an existing *language* and integrating with built-in data 
types in a backward compatible manner is a MUCH bigger lift than "putting 
Unicode into a standard library." The latter is just providing functions and/or 
an object and methods for the majority of tasks needed to process Unicode text. 

PHP already has some functions for Unicode in the standard library as have been 
mentioned, but not enough to reasonably handle most Unicode text-related tasks. 
A Unicode text processing class with the existing RFC as a starting point could 
unify that functionality and fill in the missing gaps.

BTW, I have done a significant amount of work with Unicode in Go — which 
handles code points natively, but unfortunately not grafemes — and handling 
Unicode effectively is not *that* hard. The rules are many, but they are 
straightforward. Certainly it is not harder than cryptography and encryption, 
which PHP addresses in core.

> We also have the "mbstring" extension, which was *not* designed around 
> Unicode, but was originally built for various encodings popular in Japan 20+ 
> years ago. It doesn't have the databases of codepoint information that ICU 
> does, so can't answer questions like "what script does this code point belong 
> to?" or "what is the uppercase equivalent of this grapheme, assuming a 
> Turkish locale?"

Interesting historical factoid, but how is that really relevant to including 
Unicode into the standard library?

-Mike

Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Reply via email to