Re: [PHP-DEV] Deprecate declare(encoding='...') + zend.multibyte + zend.script_encoding + zend.detect_unicode ?

Claude Pache Wed, 29 Nov 2023 12:14:39 -0800


> Le 28 nov. 2023 à 21:47, Hans Henrik Bergan <divinit...@gmail.com> a écrit :
> 
>> What is the migration path for legacy code that use those directives?
> 
> The migration path is to convert the legacy-encoding PHP files to UTF-8.
> Luckily this can be largely automated, here is my attempt:
> https://github.com/divinity76/php2utf8/blob/main/src/php2utf8.php
> but that code definitely needs some proof-reading and additions - idk
> if the approach used is even a good approach, it was just the first i
> could think of, feel free to write one from scratch


Hi,

Converting the character encoding of php files is by no means sufficient, 
except in the simplest cases.

Strings of text are to be found in various places, such as:

1. in the php files, as literals;
2. inside memory, at runtime;
3. in non-php data files stored on the server;
4. in the database;
5. as presented to the user (e.g. html document) and as received from them 
(e.g. form submission);
6. etc.

If you change the character encoding in (1), you necessarily change the 
encoding in (2), unless you wrap your literals with some function that performs 
the conversion in the other direction at runtime. And if you change the 
encoding in (2), you should be very careful when your text flows from and to 
(3), (4), (5) and (6): you should either change the encoding at those places, 
or make sure that proper conversion is done at the boundaries of those domains.

Also, mechanical conversion is not the whole story. For example, if you change 
the encoding in (5), you should not forget to adapt the <meta charset> tag 
and/or the content-type http header.

Also, all strings are not text, and only a human can decide whether the literal 
“\xe9” in a random location is meant to encode the raw byte 0xE9 or the 
character “é” in latin-1.

Of course, because we live in an interesting world, there will be situations 
where the encoding is unknown or ambiguous. Yuya mentioned the case of 
Shift-JIS which has various incompatible variants, and I am happy not to have 
encountered such ambiguities (only unknownnesses) when I converted our code 
base from windows-1252 (aka latin-1) to utf-8 a few years ago.

—Claude

Re: [PHP-DEV] Deprecate declare(encoding='...') + zend.multibyte + zend.script_encoding + zend.detect_unicode ?

Reply via email to