On 12/08/2024 17:37, Mike Schinkel wrote:
A really standout paragraph from that link is:
"IMO, the whole situation is a shame. Unicode should be
in the stdlib of every language by default. It’s the lingua
franca of the internet! It’s not even new: we’ve been living
with Unicode for 20 years now."
I actually think that paragraph rather ignores everything else the
article has just explained. "Putting Unicode in the stdlib" is an
incredibly difficult task, and it's not entirely clear what it should
even mean.
In PHP, we have ext/intl, built around a library called ICU, developed
by the Unicode consortium. Unfortunately, it only exposes a small
selection of ICU's functions, e.g. there's nothing for locale-based case
folding of whole strings. The ext/intl documentation is also very
patchy, and the actual ICU documentation isn't always much better.
The main reason it's not *mandatory* for all builds of PHP, just
"bundled", is that the sheer complexity of Unicode means that the
library is rather large - somebody (Rasmus, I think?) joked that relying
on it for PHP 6 would have made PHP a small library attached to the side
of ICU.
We also have the "mbstring" extension, which was *not* designed around
Unicode, but was originally built for various encodings popular in Japan
20+ years ago. It doesn't have the databases of codepoint information
that ICU does, so can't answer questions like "what script does this
code point belong to?" or "what is the uppercase equivalent of this
grapheme, assuming a Turkish locale?"
--
Rowan Tommins
[IMSoP]