On 21/03/2021 16:51, Larry Garfield wrote:
As Rowan notes, what people actually*want* most of the time is "I got this string
from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this
string in UTF-8." And they use utf8_encode(), which then fails*sometimes* in
exciting and mysterious ways, because that's not what it is.
[...]
If we're removing a bad answer to the problem, we should also replace it with a
good answer.
This is indeed my main concern with complete deprecation. The problem is
that detecting string encoding is a Really Hard Problem™
The fundamental problem is that any sequence of bytes is valid in any
single-byte encoding. If you're expecting printable characters only, you
can rule out some candidates if you're lucky - e.g. if your string
contains a byte in the range 0x80 to 0x9F, it's not any part of ISO 8859
- but the string "\xB0\xC0\xD0" is both valid and printable in any of
dozens of 8-bit encodings.
I recently came across a Python library implementing a clever approach
to the problem, which originated at Mozilla. Its concise FAQ is worth
reading: https://chardet.readthedocs.io/en/latest/faq.html The approach
Mozilla came up with is to decide which encoding leads to something most
likely to be natural human text - e.g. don't suggest an encoding common
for Cyrillic if the result would be completely unpronounceable in Russian.
The only function I know of which even attempts encoding detection in
PHP is mb_detect_encoding, and it does a pretty bad job. For instance:
echo mb_detect_encoding("\x80500", ['Windows-1252', 'ISO-8859-15',
'ISO-8859-1']);
...picks ISO-8859-15, where 0x80 is a rarely-used control character,
rather than Windows-1252, where it's the Euro symbol.
On the other hand, if you know what encoding you do have, either of the
following will work fine:
echo mb_convert_encoding("\x80500", 'UTF-8', 'Windows-1252');
echo iconv('Windows-1252', 'UTF-8', "\x80500");
Either of these functions (passed ISO-8859-1) can be used as a polyfill
for correct uses of utf8_encode/utf8_decode, but neither is going to do
the magic trick which people always *hope* those functions will.
Regards,
--
Rowan Tommins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php