> and why this will not return true if $str is ISO-8859-1?

For lower 7 bit characters (code points <= 127) it would return true.
But if there is a single higher character (outside of ascii), it would
only return true if the byte sequences follow UTF-8 semantics.  So it
would return false if ISO-8859-1.

For example, character é is 0xe9 (code point 234) in ISO-8859, but
character 0xc3a9 in UTF-8.  So if it encountered a byte stream such as
0xe92041 ("é A"), it knows it cannot be UTF-8 since 0xe920 is not a
valid byte sequence.  But if it saw 0xc3a92041, ("é A"), it knows it
is valid UTF-8 (it could be another character set, but it is valid in
UTF-8)...

Please note that it's not checking if the string **is** UTF-8, just if
the byte sequences in the string are valid when interpreted as UTF-8.
You could have the Latin-1 string 0xc3a92041: ("é A") which parses as
valid UTF-8...

On Wed, Jun 22, 2011 at 9:40 AM, Reindl Harald <h.rei...@thelounge.net> wrote:
>
>
> Am 22.06.2011 15:30, schrieb Gustavo Lopes:
>> Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald <h.rei...@thelounge.net> 
>> escreveu:
>>
>>> Am 22.06.2011 14:14, schrieb Gustavo Lopes:
>>>> It's actually 3 lines:
>>>>
>>>> function str_is_utf8($str) {
>>>>     return $str == "" || htmlspecialchars($str, 0, "UTF-8");
>>>> }
>>>
>>>
>>> WTF should this do?
>>> this won't return boolean
>>>
>>
>> The reason it works is that
>> 1) || coerces the operands into booleans (if they get to be evaluated)
>> 2) htmlspecialchars returns "" on bad input sequence
>> 3) (bool) "" === false
>>
>> But even if you didn't know these things, you should have bothered to at 
>> least test it
>> before sending this response
>
> and why this will not return true if $str is ISO-8859-1?
>
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to