On Sun, Mar 21, 2021, at 9:18 AM, Rowan Tommins wrote:
> Hi all,
> 
> The functions utf8_encode and utf8_decode are historical oddities, which 
> almost certainly would not be accepted if proposed today:
> 
> * Their names do not describe their functionality, which is to convert 
> to/from one specific single-byte encoding. This leads to a common 
> confusion that they can be used to "fix" UTF-8 encoding problems, which 
> they generally make worse.
> * That single-byte encoding is ISO 8859-1, not its common cousins 
> Windows-1252 or ISO 88159-15. This means, for instance, that they do not 
> handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)  
> not "\x80" (Windows-1252) or "\xA4" (8859-15)
> 
> On the other hand, they are commonly used, both correctly and 
> incorrectly, so removing them is not easy.
> 
> A previous proposal to remove them [1] resulted in Andrea making two 
> significant improvements: moving them from ext/xml to ext/standard [2] 
> and rewriting the documentation to explain them properly [3]. My genuine 
> thanks for that.
> 
> However, it hasn't stopped people misunderstanding them, and quite 
> reasonably: you shouldn't need to look up every function you use in the 
> manual, to make sure it actually does what its name suggests.
> 
> 
> I can see three ways forward:
> 
> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide 
> a specific replacement, but recommend people look at iconv() or 
> mb_convert_encoding(). There is precedent for this, such as 
> convert_cyr_string(), but it may frustrate those who are using the 
> functions correctly.
> 
> B) Introduce new names, such as utf8_to_iso_8859_1 and 
> iso_8859_1_to_utf8; immediately make those the primary names in the 
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation 
> notices for the old names, either immediately or in some future release. 
> This gives a smoother upgrade path, but commits us to having these 
> functions as outliers in our standard library.
> 
> C) Leave them alone forever. Treat it as the user's fault if they mess 
> things up by misunderstanding them.
> 
> 
> I am happy to put together an RFC for either A or B, if it has a chance 
> of reaching consensus. I would really like to avoid option C.
> 
> 
> [1] https://externals.io/message/95166
> [2] https://github.com/php/php-src/pull/2160
> [3] 
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
> 
> Regards,

I lost several days of my life to exactly this problem, many years ago.  I am 
still triggered by it.

I am mostly OK with option A, but with a big caveat:

The root problem here is "You keep using that function.  I do not think it 
means what you think it means."

As Rowan notes, what people actually *want* most of the time is "I got this 
string from a user and have NFI what it's encoding is, but my system needs 
UTF-8, so gimmie this string in UTF-8."  And they use utf8_encode(), which then 
fails *sometimes* in exciting and mysterious ways, because that's not what it 
is.

Removing utf8_encode() may keep people from misusing it, but that doesn't mean 
the problem space they were trying to solve goes away.  If anything, people who 
still don't realize that it's the wrong solution will get angry that we're 
taking away a "useful" tool and replacing it with "meh, go look at library X," 
which is admittedly a pretty rude answer.

If we're removing a bad answer to the problem, we should also replace it with a 
good answer.

Someone will, I'm sure, pop in at this point and declare "if you don't know the 
character encoding you're receiving, then you're doing it wrong and are already 
lost and we can't help you."  While that may be technically correct, it's also 
an entirely useless answer because strings received over HTTP very frequently 
do not tell you what their encoding is, or they lie about what their encoding 
is.  (The header may say it's ISO8859, or UTF8, or whatever, but someone 
copy-pasted from MS Word into a text box and now it's Windows-1252 within a 
wrapper that says ISO8859 but is mostly UTF8 except for the Windows-1252 part.  
Like, that's literally the problem I lost several days to.)  "Your own fault" 
is not even an accurate answer at that point.

So if we're going to take away people's broken hammer, we need to be very clear 
about what hammer to use instead.

The initial answer is probably "here's how to use a series of mb_string 
functions together to produce a reasonably good 
guess-my-encoding-and-convert-to-utf8 routine" documentation.  Which... may 
exist, but if it does I've never found it.  So at bare minimum the 
encode_utf8() documentation needs to include a "use this code snippet instead" 
description, and not just link to the mbstring extension.  Glancing through the 
mbstring docs right now, it looks like it's not already a single function call, 
but some combination of several, and has some global flags that get set (via 
mb_detect_order()), I think.  It's not as easy to use as utf8_encode(), even if 
utf8_encode() is wrong.  That suggests we may want to try and simplify the 
mbstring API, or internalize some function that handles the most common case in 
a way that doesn't rely on global flags.

So, let's make that easier to use, so that we can change "this function is 
wrong, we're taking it away from you" to "this function is wrong, here's a way 
better alternative that you can use instead (while we quietly take the wrong 
one away from you while you're distracted by the new shiny)."

I don't know the mbstring API well enough to say what that alternative ideally 
looks like, but if we can answer that it would make killing off the old 
functions much more palatable.

--Larry Garfield

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to