On 11/02/2022 18:42, Michał wrote:
Considering the given example, the description from the documentation
of strlen function: "Returns the length of the given string".
Which is exactly what it does. Using Unicode terminology [see
https://unicode.org/glossary], here are a few different things you could
count to determine the "length" of a string:
a) bits
b) bytes
c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of
8 bits)
d) code points (one of 1,112,064 numbers that can be given a meaning by
the Unicode standard)
e) graphemes (what a user would generally think of as a "character")
f) pixels (or any other unit of physical size)
mb_strlen() will measure (d), which is frankly pretty useless - do you
really need to know that "noél" is 5 code points long, but "noél" is
only 4? (The first uses a combining diacritic, the other a pre-composed
accented letter.)
Much more often you want strlen() to tell you (a) - one will take up 6
bytes of storage and the other only 5; or grapheme_strlen() to tell you
(e) - both have 4 graphemes.
The same goes for the "mb_strcut" function mentioned by Mel Dafert; try
running this:
echo mb_strcut('noél', 3, 3, 'UTF-8');
https://3v4l.org/s2SsR
The algorithm "correctly" keeps all the bytes of the acute accent, but
drops the "e" it was on top of; probably not a very useful result.
And that's before we get to functions which should behave differently in
different languages, like correctly capitalising "i" in Turkish:
https://en.wikipedia.org/wiki/Dotted_and_dotless_I
Doing this stuff right is really, really difficult; and that is the
reason it doesn't just "work out of the box".
Regards,
--
Rowan Tommins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php