Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

Ayesh Karunaratne Sun, 21 Mar 2021 08:01:52 -0700

Thank you for opening this conversation, these functions have stung me
in the past, and I would be so happy to see them gone :)


Personally, I would very much like to go with Plan A.

- XML parsers that often deal with non-UTF-8 character encodings
frequently use these functions. However, any parser worth their salt
is better off using mbstring or iconv because of the lack of
Windows-1252 support that is assumed elsewhere for ISO-8859. If we
have a `utf8_encode` that supports Windows-1252 as often expected, I
think plan B would be the more smoother upgrade.

 - On Packagist top 1000 downloads, stripe-php, phpcpd, pdepend,
carbon, monolog, php-cs-fixer, htmlpurifier, and aws-php-sdk use
`utf8_encode`. Some of these libraries depend on `ext-mbstring` or
Symfony mbstring polyfill, so we are left with even fewer libraries
that cannot assume `iconv()` or `mb_convert_encoding` availability.

On Sun, Mar 21, 2021 at 7:48 PM Rowan Tommins <rowan.coll...@gmail.com> wrote:
>
> Hi all,
>
> The functions utf8_encode and utf8_decode are historical oddities, which
> almost certainly would not be accepted if proposed today:
>
> * Their names do not describe their functionality, which is to convert
> to/from one specific single-byte encoding. This leads to a common
> confusion that they can be used to "fix" UTF-8 encoding problems, which
> they generally make worse.
> * That single-byte encoding is ISO 8859-1, not its common cousins
> Windows-1252 or ISO 88159-15. This means, for instance, that they do not
> handle the Euro sign: utf8_decode('€') returns '?' (i.e. unmappable)
> not "\x80" (Windows-1252) or "\xA4" (8859-15)
>
> On the other hand, they are commonly used, both correctly and
> incorrectly, so removing them is not easy.
>
> A previous proposal to remove them [1] resulted in Andrea making two
> significant improvements: moving them from ext/xml to ext/standard [2]
> and rewriting the documentation to explain them properly [3]. My genuine
> thanks for that.
>
> However, it hasn't stopped people misunderstanding them, and quite
> reasonably: you shouldn't need to look up every function you use in the
> manual, to make sure it actually does what its name suggests.
>
>
> I can see three ways forward:
>
> A) Raise a deprecation notice in 8.1, and remove in 9.0. Do not provide
> a specific replacement, but recommend people look at iconv() or
> mb_convert_encoding(). There is precedent for this, such as
> convert_cyr_string(), but it may frustrate those who are using the
> functions correctly.
>
> B) Introduce new names, such as utf8_to_iso_8859_1 and
> iso_8859_1_to_utf8; immediately make those the primary names in the
> manual, with utf8_encode / utf8_decode as aliases. Raise deprecation
> notices for the old names, either immediately or in some future release.
> This gives a smoother upgrade path, but commits us to having these
> functions as outliers in our standard library.
>
> C) Leave them alone forever. Treat it as the user's fault if they mess
> things up by misunderstanding them.
>
>
> I am happy to put together an RFC for either A or B, if it has a chance
> of reaching consensus. I would really like to avoid option C.
>
>
> [1] https://externals.io/message/95166
> [2] https://github.com/php/php-src/pull/2160
> [3]
> https://github.com/php/doc-en/commit/838941f6cce51f3beda16012eb497b26295a8238
>
> Regards,
>
> --
> Rowan Tommins
> [IMSoP]
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

Reply via email to