Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

Rowan Tommins Mon, 22 Mar 2021 03:24:12 -0700

On 22/03/2021 01:15, Sara Golemon wrote:

My preference is for a deprecation notice (but not necessarily removalever -- We can argue that part a little).

I'm strongly against any concept of "indefinite deprecation". I considerany deprecation notice a commitment to remove the feature in the future,even if a specific timeline for that removal is not given.

If we want to have a separate status of "will be kept indefinitely, butyou shouldn't use it", then we need a separate E_DISCOURAGED, or someboilerplate in the manual which doesn't use the word "deprecated".

As for details, I don't love iso_8859_1_to_utf8(), but we can use thecommon alias for iso-8859-1 known as latin1 and call the newfunctions: utf8_from_latin1() and utf8_to_latin1() with the caveatthat the later will throw a ValueError for codepoints which are out ofrange (one of the more problematic issues with utf8_decode()). Thatmakes this not just a simple rename for clarity, but what I'd considera bug-fix for an unfortunately unfixable function.

While I can see the temptation here, I'm not sure who the targetaudience for the new function would be:

* People who just want to replace calls to utf8_decode won't want to gothrough every call and make it exception safe.* People who want to write a polyfill couldn't use it, because theywouldn't be able to recover the remainder of the string after an erroris thrown.* People who want transcoding without any optional extensions will bedisappointed to find only this one encoding supported.

You'd effectively be adding a completely new core function just forthose people who work with Latin1 text, and are confident that it's notWindows-1252 in disguise.

It's tempting to make any C1 control characters an error as well -although technically valid in Latin1, these are very rarely used, andit's much more likely that any bytes in that range are intended ascharacters in Windows-1252. But that would feel very odd without havinga corresponding utf8_from_windows1252 function to use instead, at whichpoint we're into designing a whole new conversion library. And ofcourse, once you've got that UTF-8 string, you can't do much with it,because PHP's native string functions are all byte-based, so you'vebasically got to re-invent large chunks of ext/mbstring...



Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] What should we do with utf8_encode and utf8_decode?

Reply via email to