On 11/02/2022 18:42, Michał wrote:
Considering the given example, the description from the documentation of strlen function: "Returns the length of the given string".


Which is exactly what it does. Using Unicode terminology [see https://unicode.org/glossary], here are a few different things you could count to determine the "length" of a string:

a) bits
b) bytes
c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of 8 bits) d) code points (one of 1,112,064 numbers that can be given a meaning by the Unicode standard)
e) graphemes (what a user would generally think of as a "character")
f) pixels (or any other unit of physical size)

mb_strlen() will measure (d), which is frankly pretty useless - do you really need to know that "noél" is 5 code points long, but "noél" is only 4? (The first uses a combining diacritic, the other a pre-composed accented letter.)

Much more often you want strlen() to tell you (a) - one will take up 6 bytes of storage and the other only 5; or grapheme_strlen() to tell you (e) - both have 4 graphemes.


The same goes for the "mb_strcut" function mentioned by Mel Dafert; try running this:

echo mb_strcut('noél', 3, 3, 'UTF-8');

https://3v4l.org/s2SsR

The algorithm "correctly" keeps all the bytes of the acute accent, but drops the "e" it was on top of; probably not a very useful result.


And that's before we get to functions which should behave differently in different languages, like correctly capitalising "i" in Turkish: https://en.wikipedia.org/wiki/Dotted_and_dotless_I

Doing this stuff right is really, really difficult; and that is the reason it doesn't just "work out of the box".


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to