Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Rowan Tommins [IMSoP] Mon, 12 Aug 2024 13:28:41 -0700

On 12/08/2024 17:37, Mike Schinkel wrote:

A really standout paragraph from that link is:


"IMO, the whole situation is a shame. Unicode should be
in the stdlib of every language by default. It’s the lingua
franca of the internet! It’s not even new: we’ve been living
with Unicode for 20 years now."

I actually think that paragraph rather ignores everything else thearticle has just explained. "Putting Unicode in the stdlib" is anincredibly difficult task, and it's not entirely clear what it shouldeven mean.

In PHP, we have ext/intl, built around a library called ICU, developedby the Unicode consortium. Unfortunately, it only exposes a smallselection of ICU's functions, e.g. there's nothing for locale-based casefolding of whole strings. The ext/intl documentation is also verypatchy, and the actual ICU documentation isn't always much better.

The main reason it's not *mandatory* for all builds of PHP, just"bundled", is that the sheer complexity of Unicode means that thelibrary is rather large - somebody (Rasmus, I think?) joked that relyingon it for PHP 6 would have made PHP a small library attached to the sideof ICU.

We also have the "mbstring" extension, which was *not* designed aroundUnicode, but was originally built for various encodings popular in Japan20+ years ago. It doesn't have the databases of codepoint informationthat ICU does, so can't answer questions like "what script does thiscode point belong to?" or "what is the uppercase equivalent of thisgrapheme, assuming a Turkish locale?"


--
Rowan Tommins
[IMSoP]

Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Reply via email to