On Fri, Feb 11, 2022 at 3:14 AM Rowan Tommins <rowan.coll...@gmail.com>
wrote:

> There's also I think a myth in people's minds that something like
> "string length" has a single meaning, and PHP gets it "wrong" for
> multibyte strings;
>

This++.

 Unicode is not a static standard definition of all characters.  New emoji
are being added to the specification daily and while a glyph like 👪 might
look like a single "character" to a set of human eyes, and indeed in
Unicode 6.0 is a single codepoint (U+1F46A), prior to Unicode 6.0 (and
still FTR) it was still expressible using Zero Width Joining as five
separate code points: [MAN][WZJ][WOMAN][WZJ][BOY] which mb_strlen() will
tell you is five "characters" long, despite being visible as a single
grapheme.  Okay, so we look at the ICU grapheme functions, but depending on
what version of the Unicode database is installed, that answer may be five
or one.

In short: Language is complicated and there's not a one-size-fits-all
solution.

-Sara

Reply via email to