On 8/12/2024 9:53 AM, Rowan Tommins [IMSoP] wrote:


On 11 August 2024 16:50:52 BST, Nick Lockheart <li...@ageofdream.com> wrote:
It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a 
"universal character set" could just be a "wide ASCII", encoding a straightforward list 
of characters, just more of them.

Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. 
Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can 
handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? 
And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question 
of what that actually means. Often they equate "character" with "code point" and stop there, which 
leads to results that are just as useless to most of the world as if they'd equated it with "byte".

Regards,
Rowan Tommins
[IMSoP]

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)"
https://tonsky.me/blog/unicode/

Reply via email to