Peter Brodersen wrote: > On Fri, 25 Jan 2008 14:22:52 -0800, in php.internals [EMAIL PROTECTED] > (Stanislav Malyshev) wrote: > >>> Should really theses functions discard the whole string for a single >>> incomplete sequence ? >> I think since it is not possible to recover true content of the string, >> it is ok to return failure value. Cutting it in random places or >> ignoring problems doesn't seem a good idea - it might lead to all kinds >> of nasty things, such as security filtering checking one data and >> database getting entirely different data. > > On the other hand utf8_decode() also expects the input to be UTF-8 > encoded, but it replaces incomplete sequences with the character "?". > > I don't know if it is a recommended standard for invalid input but I > have seen this conversion as well in a couple of other applications, > e.g. Firefox.
utf8_decode() doesn't replace invalid chars with a ? eg. php -r '$a="abcd".chr(0xE0);echo iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1 0000000 61 62 63 64 0a 61 62 63 64 03 So, iconv() when told to take utf-8 as input and spit out utf-8 as output strips out invalid utf-8 chars whereas utf8_decode() does who knows what. 0xE0 gets converted to 0x03? It would be a horrendously bad idea to replace invalid chars with some other valid char. Way worse than returning nothing. Think about what would happen in a regex, for example, if a user was able to inject a '?' by sending an invalid utf-8 sequence that ends up in a regular expression. If we are going to do anything here, it would be to strip the invalid utf-8 bytes, but technically that's not a great solution from a security perspective. The results could be quite unexpected. The most secure approach is to fail on invalid input. It's your job to validate input and feed the function the input it expects. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php