On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals [EMAIL PROTECTED]
(Rasmus Lerdorf) wrote:

>It would be a horrendously bad idea to replace invalid chars with some
>other valid char.  Way worse than returning nothing.  Think about what
>would happen in a regex, for example, if a user was able to inject a '?'
>by sending an invalid utf-8 sequence that ends up in a regular expression.

By the way, unicode characters that doesn't exist in iso8859-1 are also
replaced into a question mark:

$ php -r 'print utf8_decode(pack("c*",0xe2,0x98,0x83));'|od -t x1
0000000 3f

http://php.net/xml also documents this replacement:
==
If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will be
"demoted". Currently, this means that such characters are replaced by a
question mark.
==

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions:
==
According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" and
"characters that are not within the adopted subset shall be indicated
to the user" by a receiving device. A quite commonly used approach in
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
replacement character (U+FFFD), which looks a bit like an inverted
question mark, or a similar symbol. It might be a good idea to
visually distinguish a malformed UTF-8 sequence from a correctly
encoded Unicode character that is just not available in the current
font but otherwise fully legal, even though ISO 10646-1 doesn't
mandate this. In any case, just ignoring malformed sequences or
unavailable characters does not conform to ISO 10646, will make
debugging more difficult, and can lead to user confusion.
==


-- 
- Peter Brodersen

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to