On 12/08/2024 17:37, Mike Schinkel wrote:
A really standout paragraph from that link is:

"IMO, the whole situation is a shame. Unicode should be
in the stdlib of every language by default. It’s the lingua
franca of the internet! It’s not even new: we’ve been living
with Unicode for 20 years now."


I actually think that paragraph rather ignores everything else the article has just explained. "Putting Unicode in the stdlib" is an incredibly difficult task, and it's not entirely clear what it should even mean.

In PHP, we have ext/intl, built around a library called ICU, developed by the Unicode consortium. Unfortunately, it only exposes a small selection of ICU's functions, e.g. there's nothing for locale-based case folding of whole strings. The ext/intl documentation is also very patchy, and the actual ICU documentation isn't always much better.

The main reason it's not *mandatory* for all builds of PHP, just "bundled", is that the sheer complexity of Unicode means that the library is rather large - somebody (Rasmus, I think?) joked that relying on it for PHP 6 would have made PHP a small library attached to the side of ICU.

We also have the "mbstring" extension, which was *not* designed around Unicode, but was originally built for various encodings popular in Japan 20+ years ago. It doesn't have the databases of codepoint information that ICU does, so can't answer questions like "what script does this code point belong to?" or "what is the uppercase equivalent of this grapheme, assuming a Turkish locale?"

--
Rowan Tommins
[IMSoP]

Reply via email to