On 20/06/2019 23:30, Mark Randall wrote:
There does at least seem to be the starting point in that mb_string is already widely used, and my suggestion that it "work as expected" is more that it would work as the equivalent mb_string / iconv function would.
I think this is a rather short-sighted way of looking at it. If people want the API provided by the mbstring extension, they can just use those functions; the advantage of designing a new set of functions is surely that we don't need to stick to past decisions. If we start to build a new standard library, as Zeev suggested in the deprecation thread, it is a once-in-a-lifetime chance to build something better, not just copy what's gone before.
mb_strlen returns the number of codepoints for example, I'm not immediately seeing anything about mb_string supporting Graphemes as the only reference I could find to their manipulation was The intl extension.
The mbstring extension was not built for Unicode, but for older Japanese multi-byte encodings, where the definition of "character" is much more straight-forward. Its Unicode support seems to mostly see code points as mappings for characters in some other encoding. (The oldest manual page for it on archive.org [1] is from 2001, and includes the quaint remark "As Unicode is getting popular, UTF-8 is used also.") The iconv library is even more explicitly aimed at converting between character sets, rather than understanding them (the extra functions such as iconv_strlen are unique to PHP).
Unicode today is much more than a mapping of legacy encodings to a universal character set, and I can think of no useful purpose in declaring the "string length" of the British flag emoji to be 2, just because it is encoded as the sequence U+1F1EC U+1F1E7.
[1] http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php
Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php