In PHP 6, incoming user data will automatically be in (unicode) form. (That is, assuming that the JIT functionality for converting gets implemented).
One of the implementation details I'd like to consider involves non-XML and/or non-SGML codepoints inside markup. As per the Unicode specification, it is perfectly valid for a Unicode string to contain the codepoints U+0000 (null byte), U+FFFF (non-character) and friends. However, it is not valid for an XML document to contain these characters; either of these will result in a fatal error. Classically, it was very difficult for PHP scripts to implement UTF-8 support completely correctly. Many implementations check that the UTF-8 is well-formed, but neglect to strip out null-bytes and the like. I consider validation/filtering against the XML char production (or perhaps even more restrictive, as that allows some control characters not allowed in HTML). How should we go about making this easy in PHP 6? Perhaps a web_encoding (terrible name, I know) function is in order? -- Edward Z. Yang GnuPG: 0x869C48DA HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php