On 05/09/2011 11:25 PM, Noah Misch wrote:
I see you've gone with doing it unconditionally. I'd lean toward testing the
library in pg_xml_init and setting a flag indicating whether we need the extra
pass. However, a later patch can always optimize that.
I wasn't terribly keen on the idea, but we can look at it again later.
Please review and try to break.
Here are the test cases I tried:
-- caught successfully
SELECT E'\x01'::xml;
SELECT xmlcomment(E'\x01');
SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
SELECT xmlelement(name foo, NULL, E'\x01');
SELECT xmlforest(E'\x01' AS foo);
SELECT xmlpi(name foo, E'\x01');
SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');
-- not caught
SELECT xmlroot('<root/>', version E'\x01');
That's an easy fix.
SELECT xmlcomment(E'\ufffe');
That's a bit harder. Do we want to extend these checks to cover
surrogates and end of plane characters, which are the remaining
forbidden chars? It certainly seems likely to be a somewhat slower test
since I think we'd need to process the input strings a Unicode char at a
time, but we do that in other places and it seems acceptable. What do
people think?
-- not directly related, but also wrongly accepted
SELECT xmlroot('<root/>', version ' ');
SELECT xmlroot('<root/>', version 'foo');
Offhand, I don't find libxml2's handling of XML declarations particularly
consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'.
Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
accepts anything, even control characters. The XML 1.0 standard is stricter:
the version must match ^1\.[0-9]+$. We might want to tighten this at the same
time.
We can add some stuff to check the version strings. Doesn't seem
terribly difficult.
libxml2's error message for this case is "PCDATA invalid Char value 1"
(assuming \x01). Mentioning PCDATA seems redundant, since no other context
offers greater freedom. How about:
ereport(ERROR,
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("invalid XML 1.0 Char \\U%08x", char_val)));
That would also mean processing the string a unicode char at a time. So
maybe that's what we need to do.
Thanks for the input.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers