On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals [EMAIL PROTECTED] (Rasmus Lerdorf) wrote:
>It would be a horrendously bad idea to replace invalid chars with some >other valid char. Way worse than returning nothing. Think about what >would happen in a regex, for example, if a user was able to inject a '?' >by sending an invalid utf-8 sequence that ends up in a regular expression. By the way, unicode characters that doesn't exist in iso8859-1 are also replaced into a question mark: $ php -r 'print utf8_decode(pack("c*",0xe2,0x98,0x83));'|od -t x1 0000000 3f http://php.net/xml also documents this replacement: == If PHP encounters characters in the parsed XML document that can not be represented in the chosen target encoding, the problem characters will be "demoted". Currently, this means that such characters are replaced by a question mark. == http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions: == According to ISO 10646-1:2000, sections D.7 and 2.3c, a device receiving UTF-8 shall interpret a "malformed sequence in the same way that it interprets a character that is outside the adopted subset" and "characters that are not within the adopted subset shall be indicated to the user" by a receiving device. A quite commonly used approach in UTF-8 decoders is to replace any malformed UTF-8 sequence by a replacement character (U+FFFD), which looks a bit like an inverted question mark, or a similar symbol. It might be a good idea to visually distinguish a malformed UTF-8 sequence from a correctly encoded Unicode character that is just not available in the current font but otherwise fully legal, even though ISO 10646-1 doesn't mandate this. In any case, just ignoring malformed sequences or unavailable characters does not conform to ISO 10646, will make debugging more difficult, and can lead to user confusion. == -- - Peter Brodersen -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php