On 21/05/07, Tomas Kuliavas <[EMAIL PROTECTED]> wrote:
Latin capital letter A with diaeresis is 00C4. Not C4.
Pay attention in maths, leading zeroes don't change a number.
I wrote two 8bit values. Not two 16bit ones. Interpreter tries to outsmart me and thinks that I want 00C4, when I write C4.
No, you didn't do anything with bits. "" is a unicode string, in unicode strings you are handling codeunits, not bytes. And codeunit 0xC4 is the same as the codeunit 0x00C4 because it's the same number, and it's the codeunit pointing to a capital A with diaresis.
http://www.php.net/language.types.string --- \x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular expression is a character in hexadecimal notation --- One or two alphanumerics after x. This escape is used to write 8bit values. You can't write 16 bit Unicode characters with one escape.
You are quoting php5 documentation, you can't expect the documentation to reflect code that isn't even alpha. What you quote is true in php6 for binary strings (b prefix) when you read "character" in the C sense. (When you read "character" as "codeunit" it's true for php6 too - but you shouldn't use the word "character" that much, as a "character" is a pretty misleading concept - do you mean a codeunit, a codepoint, a grapheme, a glyph?) And you CAN write a codeunit in one escape, like "\u0105". Note that codeunit != codepoint, var_dump(strlen("\uD801\uDC00")); gives int(1) because there are surrogates involved.
And again you are suggesting me unportable solution. Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in test5.php on line 2
Tough luck. Unicode is a major change, no major change without breakage. It could be made more compatible by using 'u' for marking unicode strings and no prefix for binary strings, but most of the time you want to handle text, not binary data, so that would be an additional burden for the developer. If you definitely want to keep supporting old versions, i'd suggest you use different files for different versions and conditionally include them. Nightmare to maintain, but that's another thing...
I don't want to maintain different script version for PHP6 unicode.semantics=on.
Well, /I/ don't want to see progress hindered by backwards compatibility.
I'll wait for better documentation on unicode.*_encoding options and will see what I can do with them.
Well, no encoding option will make "ą" == "\xC4\x85"... To see how unicode string handling works, you can have a look at python. It's pretty similar... Regards, Stefan