> -----Original Message----- > From: Tomas Kuliavas [mailto:[EMAIL PROTECTED] > Sent: 21 May 2007 19:26 > To: Andrei Zmievski > Cc: internals@lists.php.net > Subject: Re: [PHP-DEV] PHP Unicode extension in PHP6 > > >> 0xC4 and 0x85 are hex codes for latin small letter a with > ogonek in > >> utf-8. ą > >> > >> <?php > >> var_dump("ą" == "\xC4\x85"); > >> echo "ą\n"; > >> echo "\xC4\x85"; > >> ?> > >> > >> If script is written in utf-8, I expect bool(true) on > var_dump() line. > > > > var_dump("ą" == b"\xC4\x85"); > > > > This will give you what you want, if the script is written in UTF-8 > > and your runtime encoding is set to UTF-8. > > > >> <?php > >> // example uses utf-8. similar code is used in iso-8859-2 - // > >> iso-8859-16 decoding. utf-8 decoding does not need mapping > tables // > >> and is written in pcre. > >> $s1 = "ą"; > >> $s2 = "\xC4\x85"; > >> echo str_replace($s2,'ą',$s1); > >> ?> > >> > >> Expected result: ą > >> Got: ą > >> > >> test setup (php6.0-200705190630) uses trimmed php.ini with only > >> unicode.semantics=on setting > >> > >> unicode.fallback_encoding - no value > >> unicode.filesystem_encoding - no value > unicode.http_input_encoding - > >> no value unicode.output_encoding - no value > unicode.runtime_encoding > >> - no value unicode.script_encoding - no value > unicode.semantics - On > >> unicode.stream_encoding - UTF-8 > > > > Why didn't you set any encoding settings? > > They are not documented and I am testing configurations that > might break scripts. If I test things and want to make code > portable, configuration is not supposed to be rational. I can > set option with ini_set(), if I understand what option does > and it fixes the issue. > > http://www.php.net/unicode > > Do you have updated documentation version which explains > encoding settings and lists available configuration values? > Or am I testing PHP6 too early and you are still months or > years away from 6.0.0 betas and rcs? Could you implement > pseudo encoding similar to 'pass' encoding used in mbstring? > Current implementation does not give controls needed by > script writers. > > SquirrelMail scripts are not written in unicode. They are in > ascii. If some 8bit value is used, it is always written in > octal or hex notation. > These hex values are not written in one character set. In > some cases scripts use byte values. For example, locating > first utf-8 byte or looking for 0x80-0xFF bytes in string. In > other cases they are written in source or target character > set. For example, iso-8859-2 decoding function contains array > with iso-8859-2 hex values mapped to html codes. Code can't > use raw 8bit strings, because they might be corrupted in > misconfigured editor used by developer and it is very hard to > track such corruption. > 8bit data can come only from user input (composed emails and > preferences, html forms, one common charset) and imap server > (received emails, lots of different charsets and encodings). >
Recent versions of PHP5, has a binary string introducer. echo strlen(b"\xC4\x85"); Jared -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php