Hi, IMHO, #42396 is not a bug, but it is the specification. The normal script doesn't contain a null byte if it is not encoded in Unicode.
It is understandable the addition of a unique byte seqence '0xFFFFFFFF' detection to support PHAR/PHK, but it is a change to add a new feature. Rui On Thu, 23 Aug 2007 18:58:52 +0200 LAUPRETRE Franč¼is (P) <[EMAIL PROTECTED]> wrote: > Hi, > > Here is a patch I am submitting to fix bug #42396 (PHP 5). > > The problem: when PHP is configured with the '--enable-zend-multibyte' > option, it tries to autodetect unicode-encoded scripts. Then, if a script > contains null bytes after an __halt_compiler() directive, it will be > considered as UTF-16 or 32, and the execution typically results in a lot of > '?' garbage. In practice, it makes PHK and PHAR incompatible with the > zend-multibyte feature. > > The only workaround was to turn off the (undocumented) 'detect_unicode' flag. > But it is not a real solution, as people may want to use unicode detection > along with PHK/PHAR packages, and there's no logical reason to keep them > incompatible. > > The patch I am submitting assumes that a document encoded in UTF-8, UTF-16, > or UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small > detection loop before scanning the script for null bytes. If a sequence of 4 > 0xff is found, the unicode detection is aborted and the script is considered > as non unicode, whatever other binary data it can contain. Of course, this > detection happens after looking for a byte-order mark. > > Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the > __halt_compiler() directive, which makes the generated PHK archives > compatible with zend-multibyte. The same for PHAR. > > It would be better if we could scan the script for null bytes only up to the > __halt_compiler() directive, but I suspect it to be impossible as it is not > yet compiled... > > Regards > > Francois > > --- zend_multibyte.c.old 2007-01-01 10:35:46.000000000 +0100 > +++ zend_multibyte.c 2007-08-23 17:22:24.000000000 +0200 > @@ -1035,6 +1035,7 @@ > zend_encoding *script_encoding = NULL; > int bom_size; > char *script; > + unsigned char *p,*p_end; > > if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) { > return NULL; > @@ -1069,6 +1070,18 @@ > return script_encoding; > } > > + /* Search for four 0xff bytes - if found, script cannot be unicode */ > + > + p=(unsigned char *)LANG_SCNG(script_org); > + p_end=(p+LANG_SCNG(script_org_size)-3); > + while (p < p_end) { > + if ( ((* p) ==(unsigned char)0x0ff) > + && ((*(p+1))==(unsigned char)0x0ff) > + && ((*(p+2))==(unsigned char)0x0ff) > + && ((*(p+3))==(unsigned char)0x0ff)) return NULL; > + p++; > + } > + > /* script contains NULL bytes -> auto-detection */ > if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) { > /* make best effort if BOM is missing */ > -- Rui Hirokawa <[EMAIL PROTECTED]> -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php