On 21/03/2021 16:51, Larry Garfield wrote:
As Rowan notes, what people actually*want*  most of the time is "I got this string 
from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this 
string in UTF-8."  And they use utf8_encode(), which then fails*sometimes*  in 
exciting and mysterious ways, because that's not what it is.

[...]

If we're removing a bad answer to the problem, we should also replace it with a 
good answer.


This is indeed my main concern with complete deprecation. The problem is that detecting string encoding is a Really Hard Problem™

The fundamental problem is that any sequence of bytes is valid in any single-byte encoding. If you're expecting printable characters only, you can rule out some candidates if you're lucky - e.g. if your string contains a byte in the range 0x80 to 0x9F, it's not any part of ISO 8859 - but the string "\xB0\xC0\xD0" is both valid and printable in any of dozens of 8-bit encodings.

I recently came across a Python library implementing a clever approach to the problem, which originated at Mozilla. Its concise FAQ is worth reading: https://chardet.readthedocs.io/en/latest/faq.html The approach Mozilla came up with is to decide which encoding leads to something most likely to be natural human text - e.g. don't suggest an encoding common for Cyrillic if the result would be completely unpronounceable in Russian.


The only function I know of which even attempts encoding detection in PHP is mb_detect_encoding, and it does a pretty bad job. For instance:

echo mb_detect_encoding("\x80500", ['Windows-1252', 'ISO-8859-15', 'ISO-8859-1']);

...picks ISO-8859-15, where 0x80 is a rarely-used control character, rather than Windows-1252, where it's the Euro symbol.


On the other hand, if you know what encoding you do have, either of the following will work fine:

echo mb_convert_encoding("\x80500", 'UTF-8', 'Windows-1252');
echo iconv('Windows-1252', 'UTF-8', "\x80500");

Either of these functions (passed ISO-8859-1) can be used as a polyfill for correct uses of utf8_encode/utf8_decode, but neither is going to do the magic trick which people always *hope* those functions will.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to