Peter Brodersen wrote:
> On Fri, 25 Jan 2008 14:22:52 -0800, in php.internals [EMAIL PROTECTED]
> (Stanislav Malyshev) wrote:
> 
>>> Should really theses functions discard the whole string for a single 
>>> incomplete sequence ?
>> I think since it is not possible to recover true content of the string, 
>> it is ok to return failure value. Cutting it in random places or 
>> ignoring problems doesn't seem a good idea - it might lead to all kinds 
>> of nasty things, such as security filtering checking one data and 
>> database getting entirely different data.
> 
> On the other hand utf8_decode() also expects the input to be UTF-8
> encoded, but it replaces incomplete sequences with the character "?".
> 
> I don't know if it is a recommended standard for invalid input but I
> have seen this conversion as well in a couple of other applications,
> e.g. Firefox.

utf8_decode() doesn't replace invalid chars with a ?

eg.

php -r '$a="abcd".chr(0xE0);echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1

0000000    61  62  63  64  0a  61  62  63  64  03

So, iconv() when told to take utf-8 as input and spit out utf-8 as
output strips out invalid utf-8 chars whereas utf8_decode() does who
knows what.  0xE0 gets converted to 0x03?

It would be a horrendously bad idea to replace invalid chars with some
other valid char.  Way worse than returning nothing.  Think about what
would happen in a regex, for example, if a user was able to inject a '?'
by sending an invalid utf-8 sequence that ends up in a regular expression.

If we are going to do anything here, it would be to strip the invalid
utf-8 bytes, but technically that's not a great solution from a security
perspective.  The results could be quite unexpected.  The most secure
approach is to fail on invalid input.  It's your job to validate input
and feed the function the input it expects.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to