Re: [PHP-DEV] Multibyte strings

Rowan Tommins Fri, 11 Feb 2022 12:47:52 -0800

On 11/02/2022 18:42, Michał wrote:

Considering the given example, the description from the documentationof strlen function: "Returns the length of the given string".

Which is exactly what it does. Using Unicode terminology [seehttps://unicode.org/glossary], here are a few different things you couldcount to determine the "length" of a string:


a) bits
b) bytes

c) code units (UTF-16 has code units of 16 bits, UTF-8 has code units of8 bits)d) code points (one of 1,112,064 numbers that can be given a meaning bythe Unicode standard)

e) graphemes (what a user would generally think of as a "character")
f) pixels (or any other unit of physical size)

mb_strlen() will measure (d), which is frankly pretty useless - do youreally need to know that "noél" is 5 code points long, but "noél" isonly 4? (The first uses a combining diacritic, the other a pre-composedaccented letter.)

Much more often you want strlen() to tell you (a) - one will take up 6bytes of storage and the other only 5; or grapheme_strlen() to tell you(e) - both have 4 graphemes.

The same goes for the "mb_strcut" function mentioned by Mel Dafert; tryrunning this:


echo mb_strcut('noél', 3, 3, 'UTF-8');

https://3v4l.org/s2SsR

The algorithm "correctly" keeps all the bytes of the acute accent, butdrops the "e" it was on top of; probably not a very useful result.

And that's before we get to functions which should behave differently indifferent languages, like correctly capitalising "i" in Turkish:https://en.wikipedia.org/wiki/Dotted_and_dotless_I

Doing this stuff right is really, really difficult; and that is thereason it doesn't just "work out of the box".



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Multibyte strings

Reply via email to