Hi,

Here is a patch I am submitting to fix bug #42396 (PHP 5).

The problem: when PHP is configured with the '--enable-zend-multibyte' option, 
it tries to autodetect unicode-encoded scripts. Then, if a script contains null 
bytes after an __halt_compiler() directive, it will be considered as UTF-16 or 
32, and the execution typically results in a lot of '?' garbage. In practice, 
it makes PHK and PHAR incompatible with the zend-multibyte feature.

The only workaround was to turn off the (undocumented) 'detect_unicode' flag. 
But it is not a real solution, as people may want to use unicode detection 
along with PHK/PHAR packages, and there's no logical reason to keep them 
incompatible.

The patch I am submitting assumes that a document encoded in UTF-8, UTF-16, or 
UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small 
detection loop before scanning the script for null bytes. If a sequence of 4 
0xff is found, the unicode detection is aborted and the script is considered as 
non unicode, whatever other binary data it can contain. Of course, this 
detection happens after looking for a byte-order mark.

Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the 
__halt_compiler() directive, which makes the generated PHK archives compatible 
with zend-multibyte. The same for PHAR.

It would be better if we could scan the script for null bytes only up to the 
__halt_compiler() directive, but I suspect it to be impossible as it is not yet 
compiled...

Regards

Francois

--- zend_multibyte.c.old        2007-01-01 10:35:46.000000000 +0100
+++ zend_multibyte.c    2007-08-23 17:22:24.000000000 +0200
@@ -1035,6 +1035,7 @@
        zend_encoding *script_encoding = NULL;
        int bom_size;
        char *script;
+       unsigned char *p,*p_end;
 
        if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) {
                return NULL;
@@ -1069,6 +1070,18 @@
                return script_encoding;
        }
 
+       /* Search for four 0xff bytes - if found, script cannot be unicode */
+
+       p=(unsigned char *)LANG_SCNG(script_org);
+       p_end=(p+LANG_SCNG(script_org_size)-3);
+       while (p < p_end) {
+               if (   ((* p)   ==(unsigned char)0x0ff)
+                       && ((*(p+1))==(unsigned char)0x0ff)
+                       && ((*(p+2))==(unsigned char)0x0ff)
+                       && ((*(p+3))==(unsigned char)0x0ff)) return NULL;
+               p++;
+       }
+
        /* script contains NULL bytes -> auto-detection */
        if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) {
                /* make best effort if BOM is missing */

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to