Hi, Here is a patch I am submitting to fix bug #42396 (PHP 5).
The problem: when PHP is configured with the '--enable-zend-multibyte' option, it tries to autodetect unicode-encoded scripts. Then, if a script contains null bytes after an __halt_compiler() directive, it will be considered as UTF-16 or 32, and the execution typically results in a lot of '?' garbage. In practice, it makes PHK and PHAR incompatible with the zend-multibyte feature. The only workaround was to turn off the (undocumented) 'detect_unicode' flag. But it is not a real solution, as people may want to use unicode detection along with PHK/PHAR packages, and there's no logical reason to keep them incompatible. The patch I am submitting assumes that a document encoded in UTF-8, UTF-16, or UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small detection loop before scanning the script for null bytes. If a sequence of 4 0xff is found, the unicode detection is aborted and the script is considered as non unicode, whatever other binary data it can contain. Of course, this detection happens after looking for a byte-order mark. Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the __halt_compiler() directive, which makes the generated PHK archives compatible with zend-multibyte. The same for PHAR. It would be better if we could scan the script for null bytes only up to the __halt_compiler() directive, but I suspect it to be impossible as it is not yet compiled... Regards Francois --- zend_multibyte.c.old 2007-01-01 10:35:46.000000000 +0100 +++ zend_multibyte.c 2007-08-23 17:22:24.000000000 +0200 @@ -1035,6 +1035,7 @@ zend_encoding *script_encoding = NULL; int bom_size; char *script; + unsigned char *p,*p_end; if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) { return NULL; @@ -1069,6 +1070,18 @@ return script_encoding; } + /* Search for four 0xff bytes - if found, script cannot be unicode */ + + p=(unsigned char *)LANG_SCNG(script_org); + p_end=(p+LANG_SCNG(script_org_size)-3); + while (p < p_end) { + if ( ((* p) ==(unsigned char)0x0ff) + && ((*(p+1))==(unsigned char)0x0ff) + && ((*(p+2))==(unsigned char)0x0ff) + && ((*(p+3))==(unsigned char)0x0ff)) return NULL; + p++; + } + /* script contains NULL bytes -> auto-detection */ if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) { /* make best effort if BOM is missing */ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php