Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

Rowan Tommins Sun, 21 Mar 2021 11:13:26 -0700

On 21/03/2021 16:51, Larry Garfield wrote:

As Rowan notes, what people actually*want*  most of the time is "I got this string 
from a user and have NFI what it's encoding is, but my system needs UTF-8, so gimmie this 
string in UTF-8."  And they use utf8_encode(), which then fails*sometimes*  in 
exciting and mysterious ways, because that's not what it is.


[...]

If we're removing a bad answer to the problem, we should also replace it with a 
good answer.

This is indeed my main concern with complete deprecation. The problem isthat detecting string encoding is a Really Hard Problem™

The fundamental problem is that any sequence of bytes is valid in anysingle-byte encoding. If you're expecting printable characters only, youcan rule out some candidates if you're lucky - e.g. if your stringcontains a byte in the range 0x80 to 0x9F, it's not any part of ISO 8859- but the string "\xB0\xC0\xD0" is both valid and printable in any ofdozens of 8-bit encodings.

I recently came across a Python library implementing a clever approachto the problem, which originated at Mozilla. Its concise FAQ is worthreading: https://chardet.readthedocs.io/en/latest/faq.html The approachMozilla came up with is to decide which encoding leads to something mostlikely to be natural human text - e.g. don't suggest an encoding commonfor Cyrillic if the result would be completely unpronounceable in Russian.

The only function I know of which even attempts encoding detection inPHP is mb_detect_encoding, and it does a pretty bad job. For instance:

echo mb_detect_encoding("\x80500", ['Windows-1252', 'ISO-8859-15','ISO-8859-1']);

...picks ISO-8859-15, where 0x80 is a rarely-used control character,rather than Windows-1252, where it's the Euro symbol.

On the other hand, if you know what encoding you do have, either of thefollowing will work fine:


echo mb_convert_encoding("\x80500", 'UTF-8', 'Windows-1252');
echo iconv('Windows-1252', 'UTF-8', "\x80500");

Either of these functions (passed ISO-8859-1) can be used as a polyfillfor correct uses of utf8_encode/utf8_decode, but neither is going to dothe magic trick which people always *hope* those functions will.



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

Reply via email to