RE: [PHP-DEV] PHP Unicode extension in PHP6

Jared Williams Tue, 22 May 2007 07:41:03 -0700

 

> -----Original Message-----
> From: Tomas Kuliavas [mailto:[EMAIL PROTECTED] 
> Sent: 21 May 2007 19:26
> To: Andrei Zmievski
> Cc: internals@lists.php.net
> Subject: Re: [PHP-DEV] PHP Unicode extension in PHP6
> 
> >> 0xC4 and 0x85 are hex codes for latin small letter a with 
> ogonek in 
> >> utf-8. ą
> >>
> >> <?php
> >> var_dump("ą" == "\xC4\x85");
> >> echo "ą\n";
> >> echo "\xC4\x85";
> >> ?>
> >>
> >> If script is written in utf-8, I expect bool(true) on 
> var_dump() line.
> >
> > var_dump("ą" == b"\xC4\x85");
> >
> > This will give you what you want, if the script is written in UTF-8 
> > and your runtime encoding is set to UTF-8.
> >
> >> <?php
> >> // example uses utf-8. similar code is used in iso-8859-2 - // 
> >> iso-8859-16 decoding. utf-8 decoding does not need mapping 
> tables // 
> >> and is written in pcre.
> >> $s1 = "ą";
> >> $s2 = "\xC4\x85";
> >> echo str_replace($s2,'&#261;',$s1);
> >> ?>
> >>
> >> Expected result: &#261;
> >> Got: ą
> >>
> >> test setup (php6.0-200705190630) uses trimmed php.ini with only 
> >> unicode.semantics=on setting
> >>
> >> unicode.fallback_encoding - no value
> >> unicode.filesystem_encoding - no value 
> unicode.http_input_encoding - 
> >> no value unicode.output_encoding - no value 
> unicode.runtime_encoding 
> >> - no value unicode.script_encoding - no value 
> unicode.semantics - On 
> >> unicode.stream_encoding - UTF-8
> >
> > Why didn't you set any encoding settings?
> 
> They are not documented and I am testing configurations that 
> might break scripts. If I test things and want to make code 
> portable, configuration is not supposed to be rational. I can 
> set option with ini_set(), if I understand what option does 
> and it fixes the issue.
> 
> http://www.php.net/unicode
> 
> Do you have updated documentation version which explains 
> encoding settings and lists available configuration values? 
> Or am I testing PHP6 too early and you are still months or 
> years away from 6.0.0 betas and rcs? Could you implement 
> pseudo encoding similar to 'pass' encoding used in mbstring?
> Current implementation does not give controls needed by 
> script writers.
> 
> SquirrelMail scripts are not written in unicode. They are in 
> ascii. If some 8bit value is used, it is always written in 
> octal or hex notation.
> These hex values are not written in one character set. In 
> some cases scripts use byte values. For example, locating 
> first utf-8 byte or looking for 0x80-0xFF bytes in string. In 
> other cases they are written in source or target character 
> set. For example, iso-8859-2 decoding function contains array 
> with iso-8859-2 hex values mapped to html codes. Code can't 
> use raw 8bit strings, because they might be corrupted in 
> misconfigured editor used by developer and it is very hard to 
> track such corruption.
> 8bit data can come only from user input (composed emails and 
> preferences, html forms, one common charset) and imap server 
> (received emails, lots of different charsets and encodings).
>


Recent versions of PHP5, has a binary string introducer.

echo strlen(b"\xC4\x85");

Jared

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP-DEV] PHP Unicode extension in PHP6

Reply via email to