[PHP-DEV] Unicode and XML

Edward Z. Yang Wed, 28 May 2008 21:24:04 -0700

In PHP 6, incoming user data will automatically be in (unicode) form.
(That is, assuming that the JIT functionality for converting gets
implemented).


One of the implementation details I'd like to consider involves non-XML
and/or non-SGML codepoints inside markup. As per the Unicode
specification, it is perfectly valid for a Unicode string to contain the
codepoints U+0000 (null byte), U+FFFF (non-character) and friends.
However, it is not valid for an XML document to contain these
characters; either of these will result in a fatal error.

Classically, it was very difficult for PHP scripts to implement UTF-8
support completely correctly. Many implementations check that the UTF-8
is well-formed, but neglect to strip out null-bytes and the like. I
consider validation/filtering against the XML char production (or
perhaps even more restrictive, as that allows some control characters
not allowed in HTML).

How should we go about making this easy in PHP 6? Perhaps a web_encoding
(terrible name, I know) function is in order?
-- 
 Edward Z. Yang                        GnuPG: 0x869C48DA
 HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] Unicode and XML

Reply via email to