Alvaro has committed two of the patches in this CF entry[1], but the remaining two have yet to attract review.
This message contains only those two, just as before[2] except rebased over Alvaro's commits of the others. <confession> There are two new changes. The <ulink>s added in the -docfix patch now have link text, as allowed according to b060e6c, and the -content patch now updates the definition of the 'content' type in datatypes.sgml, which I had overlooked before. </confession> xml-functions-type-docfix-3.patch adjusts the documentation of the XML type and related functions to present some behavior and limitations more clearly. xml-content-2006-2.patch changes the behavior of xmlparse and the text-to-xml cast to allow any XML 'document' (including one with a DTD) to be parsed as 'content', where the former behavior was to fail in that case. This is the same as changing the definition of XML 'content' from that of SQL:2003 to that of SQL:2006 and later. The later definition is preferable, because it eliminates a case that can fail in, e.g., pg_restore (which problem has been reported in the field). The patches apply in that order (because the -docfix one adds language describing the current 'content' behavior, then the -content one changes the behavior, and the language to match it). Regards, -Chap [1] https://commitfest.postgresql.org/22/1872/ [2] https://www.postgresql.org/message-id/3e8eab9e-7289-6c23-5e2c-153cccea2...@anastigmatix.net
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index b462c06..94b46a6 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4282,16 +4282,21 @@ SET XML OPTION { DOCUMENT | CONTENT }; SET xmloption TO { DOCUMENT | CONTENT }; </synopsis> The default is <literal>CONTENT</literal>, so all forms of XML - data are allowed. + data are allowed except as noted below. </para> <note> <para> - With the default XML option setting, you cannot directly cast - character strings to type <type>xml</type> if they contain a - document type declaration, because the definition of XML content - fragment does not accept them. If you need to do that, either - use <literal>XMLPARSE</literal> or change the XML option. + In the SQL:2006 and later standard, the <literal>CONTENT</literal> form + is a proper superset of the <literal>DOCUMENT</literal> form, and so the + default XML option setting would allow casting to XML from character + strings in either form. <productname>PostgreSQL</productname>, however, + uses the SQL:2003 definition in which <literal>CONTENT</literal> form + cannot contain a document type declaration. Therefore, there is no one + setting of the XML option that will allow casting to XML from every valid + character string. The default will work almost always, but for a document + with a DTD, it will be necessary to change the XML option or + use <literal>XMLPARSE</literal> specifying <literal>DOCUMENT</literal>. </para> </note> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 03859a7..0017aab 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10140,8 +10140,13 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple <sect1 id="functions-xml"> + <title>XML Functions</title> + <indexterm> + <primary>XML Functions</primary> + </indexterm> + <para> The functions and function-like expressions described in this section operate on values of type <type>xml</type>. Check <xref @@ -10344,8 +10349,8 @@ SELECT xmlelement(name foo, xmlattributes('xyz' as bar), encoding, depending on the setting of the configuration parameter <xref linkend="guc-xmlbinary"/>. The particular behavior for individual data types is expected to evolve in order to align the - SQL and PostgreSQL data types with the XML Schema specification, - at which point a more precise description will appear. + PostgreSQL mappings with those specified in SQL:2006 and later, + as discussed in <xref linkend="functions-xml-limits-casts"/>. </para> </sect3> @@ -10587,10 +10592,12 @@ SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; </synopsis> <para> - The function <function>xmlexists</function> returns true if the - XPath expression in the first argument returns any nodes, and - false otherwise. (If either argument is null, the result is - null.) + The function <function>xmlexists</function> evaluates an XPath 1.0 + expression (the first argument), with the passed value as its context item. + The function returns false if the result of that evaluation yields an + empty nodeset, true if it yields any other value. (The function returns + null if an argument is null.) A nonnull value passed as the context item + must be an XML document, not a content fragment or any non-XML value. </para> <para> @@ -10607,24 +10614,12 @@ SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY VALUE '<towns><town>T <para> The <literal>BY REF</literal> or <literal>BY VALUE</literal> clauses - have no effect in <productname>PostgreSQL</productname>, but are allowed - for compatibility with other implementations. Per the <acronym>SQL</acronym> - standard, the one that precedes any argument is required, and indicates - the default for arguments that follow, and one may follow any argument to - override the default. - <productname>PostgreSQL</productname> ignores <literal>BY REF</literal> - and passes by value always. - </para> - - <para> - In the <acronym>SQL</acronym> standard, an <function>xmlexists</function> - construct evaluates an expression in the XQuery language, allows passing - values for named parameters in the expression as well as for the context - item, and does not require the passed values to be documents, or even of - XML type. - In <productname>PostgreSQL</productname>, this construct currently only - evaluates an XPath 1.0 expression, and allows passing only one value, - which must be an XML document, to be the context item. + are accepted in <productname>PostgreSQL</productname>, but ignored, as + discussed in <xref linkend="functions-xml-limits-postgresql"/>. + In the SQL standard, an <function>xmlexists</function> construct first + appears in SQL:2006 and evaluates an expression in the XML Query language, + but this implementation allows only an XPath 1.0 expression, as discussed + in <xref linkend="functions-xml-limits-xpath1"/>. </para> </sect3> @@ -10730,12 +10725,12 @@ SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuf </synopsis> <para> - The function <function>xpath</function> evaluates the XPath + The function <function>xpath</function> evaluates the XPath 1.0 expression <replaceable>xpath</replaceable> (a <type>text</type> value) against the XML value <replaceable>xml</replaceable>. It returns an array of XML values - corresponding to the node set produced by the XPath expression. - If the XPath expression returns a scalar value rather than a node set, + corresponding to the nodeset produced by the XPath expression. + If the XPath expression returns a scalar value rather than a nodeset, a single-element array is returned. </para> @@ -10797,9 +10792,10 @@ SELECT xpath('//mydefns:b/text()', '<a xmlns="http://example.com"><b>test</b></a <para> The function <function>xpath_exists</function> is a specialized form of the <function>xpath</function> function. Instead of returning the - individual XML values that satisfy the XPath, this function returns a - Boolean indicating whether the query was satisfied or not. This - function is equivalent to the standard <literal>XMLEXISTS</literal> predicate, + individual XML values that satisfy the XPath 1.0 expression, this function + returns a Boolean indicating whether the query was satisfied or not + (specifically, whether it produced any value other than an empty nodeset). + This function is equivalent to the <literal>XMLEXISTS</literal> predicate, except that it also offers support for a namespace mapping argument. </para> @@ -10840,8 +10836,8 @@ SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</m <para> The <function>xmltable</function> function produces a table based - on the given XML value, an XPath filter to extract rows, and an - optional set of column definitions. + on the given XML value, an XPath filter to extract rows, and a + set of column definitions. </para> <para> @@ -10852,30 +10848,33 @@ SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</m </para> <para> - The required <replaceable>row_expression</replaceable> argument is an XPath - expression that is evaluated against the supplied XML document to - obtain an ordered sequence of XML nodes. This sequence is what - <function>xmltable</function> transforms into output rows. + The required <replaceable>row_expression</replaceable> argument is + an XPath 1.0 expression that is evaluated, passing the + <replaceable>document_expression</replaceable> as its context item, to + obtain a set of XML nodes. These nodes are what + <function>xmltable</function> transforms into output rows. (No rows + will be produced if the <replaceable>document_expression</replaceable> + is null, or the <replaceable>row_expression</replaceable> produces an + empty nodeset, or any value other than a nodeset.) </para> <para> - <replaceable>document_expression</replaceable> provides the XML document to - operate on. - The argument must be a well-formed XML document; fragments/forests - are not accepted. + <replaceable>document_expression</replaceable> provides the context item + for the <replaceable>row_expression</replaceable>. It must be a well-formed + XML document; fragments/forests are not accepted. The <literal>BY REF</literal> or <literal>BY VALUE</literal> clauses are - accepted, as described for the <function>xmlexists</function> predicate, - but ignored; PostgreSQL currently passes XML by value always. + accepted but ignored, as discussed in + <xref linkend="functions-xml-limits-postgresql"/>. + In the SQL standard, an <function>xmltable</function> construct first + appears in SQL:2006 and evaluates expressions in the XML Query language, + but this implementation allows only XPath 1.0 expressions, as discussed + in <xref linkend="functions-xml-limits-xpath1"/>. </para> <para> The mandatory <literal>COLUMNS</literal> clause specifies the list of columns in the output table. - If the <literal>COLUMNS</literal> clause is omitted, the rows in the result - set contain a single column of type <literal>xml</literal> containing the - data matched by <replaceable>row_expression</replaceable>. - If <literal>COLUMNS</literal> is specified, each entry describes a - single column. + Each entry describes a single column. See the syntax summary above for the format. The column name and type are required; the path, default and nullability clauses are optional. @@ -10889,42 +10888,78 @@ SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</m </para> <para> - The <literal>column_expression</literal> for a column is an XPath expression - that is evaluated for each row, relative to the result of the - <replaceable>row_expression</replaceable>, to find the value of the column. + The <literal>column_expression</literal> for a column is an XPath 1.0 + expression + that is evaluated for each row, with the current node from the + <replaceable>row_expression</replaceable> result as its context item, + to find the value of the column. If no <literal>column_expression</literal> is given, then the column name is used as an implicit path. </para> <para> - If a column's XPath expression returns multiple elements, an error - is raised. - If the expression matches an empty tag, the result is an - empty string (not <literal>NULL</literal>). - Any <literal>xsi:nil</literal> attributes are ignored. + If a column's XPath expression returns a non-XML value (limited to + string, boolean, or double in XPath 1.0) and the column has a + PostgreSQL type other than <type>xml</type>, the column will be set + as if by assigning the value's string representation to the PostgreSQL + type (adjusting the "string representation" of a boolean to + <literal>1</literal> or <literal>0</literal> if the target column type + category is numeric, otherwise <literal>true</literal> or + <literal>false</literal>). + </para> + + <para> + If the column's expression returns a non-empty set of XML nodes + and the target column's type is <type>xml</type>, the column will + be assigned the expression result exactly, if it is of document or + content form. + <footnote> + <para> + A result containing more than one element node at the top level, or + non-whitespace text outside of an element, is an example of content form. + An XPath result can be of neither form, for example if it returns an + attribute node selected from the element that contains it. Such a result + will be put into content form with each such disallowed node replaced by + its string value, as defined for the XPath 1.0 + <function>string</function> function. + </para> + </footnote> + </para> + + <para> + A non-XML result assigned to an <type>xml</type> output column produces + content, a single text node with the string value of the result. + An XML result assigned to a column of any other type may not have more than + one node, or an error is raised. If there is exactly one node, the column + will be set as if by assigning the node's string + value (as defined for the XPath 1.0 <function>string</function> function) + to the PostgreSQL type. </para> <para> - The text body of the XML matched by the <replaceable>column_expression</replaceable> - is used as the column value. Multiple <literal>text()</literal> nodes - within an element are concatenated in order. Any child elements, - processing instructions, and comments are ignored, but the text contents - of child elements are concatenated to the result. + The string value of an XML element is the concatenation, in document order, + of all text nodes contained in that element and its descendants. The string + value of an element with no descendant text nodes is an + empty string (not <literal>NULL</literal>). + Any <literal>xsi:nil</literal> attributes are ignored. Note that the whitespace-only <literal>text()</literal> node between two non-text elements is preserved, and that leading whitespace on a <literal>text()</literal> node is not flattened. + The XPath 1.0 <function>string</function> function may be consulted for the + rules defining the string value of other XML node types and non-XML values. + </para> + + <para> + The conversion rules presented here are not exactly those of the SQL + standard, as discussed in <xref linkend="functions-xml-limits-casts"/>. </para> <para> - If the path expression does not match for a given row but - <replaceable>default_expression</replaceable> is specified, the value resulting - from evaluating that expression is used. - If no <literal>DEFAULT</literal> clause is given for the column, - the field will be set to <literal>NULL</literal>. - It is possible for a <replaceable>default_expression</replaceable> to reference - the value of output columns that appear prior to it in the column list, - so the default of one column may be based on the value of another - column. + If the path expression returns an empty nodeset + (typically, when it does not match) + for a given row, the column will be set to <literal>NULL</literal>, unless + a <replaceable>default_expression</replaceable> is specified; then the + value resulting from evaluating that expression is used. </para> <para> @@ -10936,20 +10971,14 @@ SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</m </para> <para> - Unlike regular PostgreSQL functions, <replaceable>column_expression</replaceable> - and <replaceable>default_expression</replaceable> are not evaluated to a simple - value before calling the function. - <replaceable>column_expression</replaceable> is normally evaluated - exactly once per input row, and <replaceable>default_expression</replaceable> - is evaluated each time a default is needed for a field. - If the expression qualifies as stable or immutable the repeat + A <replaceable>default_expression</replaceable>, rather than being + evaluated immediately when <function>xmltable</function> is called, + is evaluated each time a default is needed for the column. + If the expression qualifies as stable or immutable, the repeat evaluation may be skipped. - Effectively <function>xmltable</function> behaves more like a subquery than a - function call. This means that you can usefully use volatile functions like - <function>nextval</function> in <replaceable>default_expression</replaceable>, and - <replaceable>column_expression</replaceable> may depend on other parts of the - XML document. + <function>nextval</function> in + <replaceable>default_expression</replaceable>. </para> <para> @@ -11297,6 +11326,293 @@ table2-mapping ]]></programlisting> </figure> </sect2> + + <sect2 id="functions-xml-limits-compatibility"> + <title>Limits and Compatibility</title> + + <indexterm> + <primary>XML Functions</primary> + <secondary>limits and compatibility</secondary> + </indexterm> + + <sect3 id="functions-xml-limits-xpath1"> + <title>Queries restricted to XPath 1.0</title> + + <para> + The <productname>PostgreSQL</productname>-specific functions + <function>xpath</function> and <function>xpath_exists</function> query + XML documents using the XPath language, and + <productname>PostgreSQL</productname> also provides XPath-only variants of + the standard functions <function>XMLEXISTS</function> and + <function>XMLTABLE</function>, which are defined in the SQL standard to use + the XQuery language. For all of these functions, + <productname>PostgreSQL</productname> relies on the + <productname>libxml</productname> library, which provides only XPath 1.0. + </para> + + <para> + There is a strong connection between the XQuery language and XPath versions + 2.0 and later: any expression that is syntactically valid and executes + successfully in both produces the same result (with a minor exception for + expressions containing numeric character references or predefined entity + references, which XQuery replaces with the corresponding character while + XPath leaves them alone). But there is no such connection between XPath 1.0 + and XQuery or the later XPath versions; it was an earlier language and + differs in many respects. + </para> + + <para> + There are two categories of limitation to keep in mind: the restriction + from XQuery to XPath for the functions specified in the SQL standard, and + the restriction of XPath to version 1.0 for both the standard and the + <productname>PostgreSQL</productname>-specific functions. + </para> + + <sect4> + <title>Restriction of XQuery to XPath</title> + + <para> + Features of XQuery beyond those of XPath include: + + <itemizedlist> + <listitem> + <para> + XQuery expressions can construct and return new XML nodes, in addition + to all possible XPath values. XPath can introduce and return values of + the atomic types (numbers, strings, and so on) but can only return XML + nodes already present in documents supplied as input to the expression. + </para> + </listitem> + + <listitem> + <para> + XQuery has control constructs for iteration, sorting, and grouping. + </para> + </listitem> + + <listitem> + <para> + XQuery allows the declaration and use of local functions. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + Recent XPath versions begin to offer overlapping capabilities + (such as the functional-style <function>for-each</function> and + <function>sort</function>, anonymous functions, and + <function>parse-xml</function> to create a node from a string), + but these were not available before XPath 3.0. + </para> + </sect4> + + <sect4> + <title>Restriction of XPath to 1.0</title> + + <para> + For developers familiar with XQuery and XPath 2.0 or later, or porting + queries from other systems, XPath 1.0 presents a number of differences to + contend with: + + <itemizedlist> + <listitem> + <para> + The fundamental type of an XQuery/XPath expression, the + <type>sequence</type>, which can contain XML nodes, atomic values, + or both, does not exist in XPath 1.0. A 1.0 expression can only produce + a nodeset (possibly empty, or with one XML node or more), or a single + atomic value. + </para> + </listitem> + + <listitem> + <para> + Unlike an XQuery/XPath sequence, which can contain any desired items + in any desired order, an XPath 1.0 nodeset has no guaranteed order and, + like any set, can have no member appear more than once. (The + <productname>libxml</productname> library does seem to always return + nodesets to <productname>PostgreSQL</productname> with their members + in the same relative order they had in the input document; it does not + commit to this behavior, and an XPath 1.0 expression cannot control + it.) + </para> + </listitem> + + <listitem> + <para> + While XQuery/XPath provides all of the types defined in XML Schema + and many operators and functions over those types, XPath 1.0 has only + nodesets and three atomic types, <type>boolean</type>, + <type>double</type>, and <type>string</type>. + </para> + </listitem> + + <listitem> + <para> + XPath 1.0 has no conditional operator. An XQuery/XPath expression + such as <userinput>if ( hat ) then hat/@size else "no hat"</userinput> + has no XPath 1.0 equivalent. + </para> + </listitem> + + <listitem> + <para> + XPath 1.0 has no ordering comparison operator for strings. Both + <userinput>"cat" < "dog"</userinput> and + <userinput>"cat" > "dog"</userinput> are false, because each is a + numeric comparison of two <literal>NaN</literal>s. In contrast, + <literal>=</literal> and <literal>!=</literal> do compare the strings + as strings. + </para> + </listitem> + + <listitem> + <para> + XPath 1.0 blurs the distinction between + <firstterm>value comparisons</firstterm> and + <firstterm>general comparisons</firstterm> as XQuery/XPath define them. + Both <userinput>sale/@hatsize = 7</userinput> and + <userinput>sale/@customer = "alice"</userinput> are existentially + quantified comparisons, true if there is any sale with the given value + for the attribute, but <userinput>sale/@taxable = false()</userinput> + is a value comparison to the + <firstterm>effective boolean value</firstterm> of a whole nodeset, + and true only if no sale has a <literal>taxable</literal> attribute + at all. + </para> + </listitem> + + <listitem> + <para> + In the XQuery/XPath data model, a <firstterm>document node</firstterm> + can have either document form (exactly one top-level element, only + comments and processing instructions outside of it) or content form + (with those constraints relaxed). Its equivalent in XPath 1.0, the + <firstterm>root node</firstterm>, can only be in document form. + This is part of the reason an <type>xml</type> value passed as the + context item to any <productname>PostgreSQL</productname> XPath-based + function must be in document form. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + The differences highlighted here are not all of them. In XQuery and + the 2.0 and later versions of XPath, there is an XPath 1.0 compatibility + mode, and the W3C lists of function library + <ulink url='https://www.w3.org/TR/2010/REC-xpath-functions-20101214/#xpath1-compatibility'>changes</ulink> + and language + <ulink url='https://www.w3.org/TR/xpath20/#id-backwards-compatibility'>changes</ulink> + applied in that mode offer a more complete account of the + differences, if still not exhaustive; the compatibility mode cannot make + the later languages exactly equivalent to XPath 1.0. + </para> + </sect4> + + <sect4 id="functions-xml-limits-casts"> + <title>Mappings between SQL and XML data types and values</title> + + <para> + In SQL:2006 and later, both directions of conversion between standard SQL + data types and the XML Schema types are specified precisely. However, the + rules are expressed using the types and semantics of XQuery/XPath, and + have no direct application to the different data model of XPath 1.0. + </para> + + <para> + When <productname>PostgreSQL</productname> maps SQL data values to XML + (as in <function>xmlelement</function>), or XML to SQL (as in the output + columns of <function>xmltable</function>), except for the few cases + treated specially, <productname>PostgreSQL</productname> simply assumes + that the XML data type's XPath 1.0 string form will be valid as the + text-input form of the SQL datatype, and conversely. This rule has the + virtue of simplicity while producing, for many data types, results similar + to the mappings specified in the standard. + </para> + + <para> + Where interoperability with other systems is a concern, for some data + types, it may be necessary to use available data type formatting functions + (such as those in <xref linkend="functions-formatting"/>) explicitly in + queries to produce the standard mappings. + </para> + </sect4> + </sect3> + + <sect3 id="functions-xml-limits-postgresql"> + <title> + Incidental limits of the implementation + </title> + + <para> + This section concerns limits that are not inherent in the + <productname>libxml</productname> library, but apply to the current + implementation in <productname>PostgreSQL</productname>. + </para> + + <sect4> + <title>Only <literal>BY VALUE</literal> passing mechanism supported</title> + + <para> + The SQL standard defines two <firstterm>passing mechanisms</firstterm> + that apply when passing an XML argument from SQL to an XML function or + receiving a result: <literal>BY REF</literal>, in which a particular XML + value retains its node identity, and <literal>BY VALUE</literal>, in which + the content of the XML is passed but node identity is not preserved. A + mechanism can be specified before a list of parameters, as the default + mechanism for all of them, or after any parameter, to override the + default. + </para> + + <para> + To illustrate the difference, if + <replaceable>x</replaceable> is an XML value, these two queries in + an SQL:2006 environment would produce true and false, respectively: + + <screen><![CDATA[ +SELECT XMLQUERY('$a is $b' PASSING BY REF x AS a, x AS b NULL ON EMPTY); +SELECT XMLQUERY('$a is $b' PASSING BY VALUE x AS a, x AS b NULL ON EMPTY); +]]></screen> + </para> + + <para> + <productname>PostgreSQL</productname> will accept either + <literal>BY VALUE</literal> or <literal>BY REF</literal> in an + <function>XMLEXISTS</function> or <function>XMLTABLE</function> construct, + but ignores them; the <type>xml</type> data type holds a character-string + serialized representation, so there is no node identity to preserve, + and passing is always <literal>BY VALUE</literal>. + </para> + </sect4> + + <sect4> + <title>Cannot pass named parameters to queries</title> + + <para> + The XPath-based functions support passing one parameter to serve as the + XPath expression's context item, but do not support passing additional + values to be available to the expression as named parameters. + </para> + </sect4> + + <sect4> + <title>No <type>XML(SEQUENCE)</type> type</title> + + <para> + The <productname>PostgreSQL</productname> <type>xml</type> can only hold + a value in <literal>DOCUMENT</literal> or <literal>CONTENT</literal> form. + An XQuery/XPath expression context item must be a single XML node + or atomic value, while XPath 1.0 further restricts it to only an XML node, + and has no node type allowing <literal>CONTENT</literal>. The upshot is + that a well-formed <literal>DOCUMENT</literal> is the only form of XML + value that <productname>PostgreSQL</productname> can supply as an XPath + context item. + </para> + </sect4> + </sect3> + </sect2> </sect1> <sect1 id="functions-json"> diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index bade0fe..e947613 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -569,7 +569,7 @@ X056 Advanced table mapping: data mapping YES X057 Advanced table mapping: metadata mapping YES X058 Advanced table mapping: base64 encoding of binary strings YES X059 Advanced table mapping: hex encoding of binary strings YES -X060 XMLParse: character string input and CONTENT option YES +X060 XMLParse: character string input and CONTENT option YES uses SQL:2003 definition, so a document with a DTD cannot be parsed as CONTENT X061 XMLParse: character string input and DOCUMENT option YES X065 XMLParse: BLOB input and CONTENT option NO X066 XMLParse: BLOB input and DOCUMENT option NO @@ -593,7 +593,7 @@ X085 Predefined namespace prefixes NO X086 XML namespace declarations in XMLTable NO X090 XML document predicate YES X091 XML content predicate NO -X096 XMLExists NO XPath only +X096 XMLExists NO XPath 1.0 only X100 Host language support for XML: CONTENT option NO X101 Host language support for XML: DOCUMENT option NO X110 Host language support for XML: VARCHAR mapping NO @@ -661,11 +661,11 @@ X282 XMLValidate with CONTENT option NO X283 XMLValidate with SEQUENCE option NO X284 XMLValidate: NAMESPACE without ELEMENT clause NO X286 XMLValidate: NO NAMESPACE with ELEMENT clause NO -X300 XMLTable NO XPath only +X300 XMLTable NO XPath 1.0 only X301 XMLTable: derived column list option YES X302 XMLTable: ordinality column option YES X303 XMLTable: column default option YES -X304 XMLTable: passing a context item YES +X304 XMLTable: passing a context item YES must be XML DOCUMENT X305 XMLTable: initializing an XQuery variable NO X400 Name and identifier mapping YES X410 Alter column data type: XML type YES
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 94b46a6..752eadf 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4208,9 +4208,11 @@ a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11 <para> The <type>xml</type> type can store well-formed <quote>documents</quote>, as defined by the XML standard, as well - as <quote>content</quote> fragments, which are defined by the - production <literal>XMLDecl? content</literal> in the XML - standard. Roughly, this means that content fragments can have + as <quote>content</quote> fragments, which are defined by reference to the + more permissive + <ulink url="https://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/#DocumentNode"><quote>document node</quote></ulink> + of the XQuery and XPath data model. + Roughly, this means that content fragments can have more than one top-level element or character node. The expression <literal><replaceable>xmlvalue</replaceable> IS DOCUMENT</literal> can be used to evaluate whether a particular <type>xml</type> @@ -4282,24 +4284,9 @@ SET XML OPTION { DOCUMENT | CONTENT }; SET xmloption TO { DOCUMENT | CONTENT }; </synopsis> The default is <literal>CONTENT</literal>, so all forms of XML - data are allowed except as noted below. + data are allowed. </para> - <note> - <para> - In the SQL:2006 and later standard, the <literal>CONTENT</literal> form - is a proper superset of the <literal>DOCUMENT</literal> form, and so the - default XML option setting would allow casting to XML from character - strings in either form. <productname>PostgreSQL</productname>, however, - uses the SQL:2003 definition in which <literal>CONTENT</literal> form - cannot contain a document type declaration. Therefore, there is no one - setting of the XML option that will allow casting to XML from every valid - character string. The default will work almost always, but for a document - with a DTD, it will be necessary to change the XML option or - use <literal>XMLPARSE</literal> specifying <literal>DOCUMENT</literal>. - </para> - </note> - </sect2> <sect2> diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index e947613..32908c1 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -569,7 +569,7 @@ X056 Advanced table mapping: data mapping YES X057 Advanced table mapping: metadata mapping YES X058 Advanced table mapping: base64 encoding of binary strings YES X059 Advanced table mapping: hex encoding of binary strings YES -X060 XMLParse: character string input and CONTENT option YES uses SQL:2003 definition, so a document with a DTD cannot be parsed as CONTENT +X060 XMLParse: character string input and CONTENT option YES X061 XMLParse: character string input and DOCUMENT option YES X065 XMLParse: BLOB input and CONTENT option NO X066 XMLParse: BLOB input and DOCUMENT option NO diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 28b3eaa..23f7dcb 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -110,6 +110,8 @@ struct PgXmlErrorContext /* current error status and accumulated message, if any */ bool err_occurred; StringInfoData err_buf; + /* set in xml_parse if trying to parse CONTENT; reset in err hdlr if DTD */ + bool tentatively_content; /* previous libxml error handling state (saved by pg_xml_init) */ xmlStructuredErrorFunc saved_errfunc; void *saved_errcxt; @@ -120,6 +122,7 @@ struct PgXmlErrorContext static xmlParserInputPtr xmlPgEntityLoader(const char *URL, const char *ID, xmlParserCtxtPtr ctxt); static void xml_errorHandler(void *data, xmlErrorPtr error); +static bool xml_onlyPrologNodes(xmlDocPtr doc); static void xml_ereport_by_code(int level, int sqlcode, const char *msg, int errcode); static void chopStringInfoNewlines(StringInfo str); @@ -1002,6 +1005,7 @@ pg_xml_init(PgXmlStrictness strictness) errcxt->magic = ERRCXT_MAGIC; errcxt->strictness = strictness; errcxt->err_occurred = false; + errcxt->tentatively_content = false; initStringInfo(&errcxt->err_buf); /* @@ -1433,6 +1437,9 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace, xmlChar *string; xmlChar *utf8string; PgXmlErrorContext *xmlerrcxt; + int errcode_document_parse = ERRCODE_INVALID_XML_DOCUMENT; + char const *errmsg_document_parse = "invalid XML document"; + char const *errmsg_content_parse = "invalid XML content"; volatile xmlParserCtxtPtr ctxt = NULL; volatile xmlDocPtr doc = NULL; @@ -1457,6 +1464,7 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace, xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY, "could not allocate parser context"); +try_other_xmloption: if (xmloption_arg == XMLOPTION_DOCUMENT) { /* @@ -1472,8 +1480,8 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace, XML_PARSE_NOENT | XML_PARSE_DTDATTR | (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS)); if (doc == NULL || xmlerrcxt->err_occurred) - xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT, - "invalid XML document"); + xml_ereport(xmlerrcxt, ERROR, errcode_document_parse, + errmsg_document_parse); } else { @@ -1497,11 +1505,42 @@ xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace, /* allow empty content */ if (*(utf8string + count)) { + xmlerrcxt->tentatively_content = true; res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string + count, NULL); + /* + * If tentatively_content is now false, the error handler has + * reset it, which will only happen for one reason: a DTD was + * found in the input. The SQL/XML:2003 definition of CONTENT + * is not satisfied by a document with a DTD, which is a bit of + * a wart, as it means the CONTENT type is not a proper superset + * of DOCUMENT. SQL/XML:2006 and later fix that, by declaring + * that any DOCUMENT value is indeed also a CONTENT value. That + * definition is more useful, as CONTENT becomes usable for + * parsing input of unknown form (think pg_restore). Without + * bringing the whole implementation along to 2006+, we can + * provide the 2006+ definition of CONTENT easily enough, by + * catching this error case and simply retrying the parse + * as DOCUMENT. + */ + if ( ! xmlerrcxt->tentatively_content ) + { + /* + * xmlParseBalanced... was not passed our ctxt; it + * creates one internally for its own use, so we have + * no need to clear/reset ctxt here. + */ + xmlFreeDoc(doc); + xmloption_arg = XMLOPTION_DOCUMENT; + errcode_document_parse = ERRCODE_INVALID_XML_CONTENT; + errmsg_document_parse = errmsg_content_parse; + goto try_other_xmloption; + } if (res_code != 0 || xmlerrcxt->err_occurred) + { xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_CONTENT, - "invalid XML content"); + errmsg_content_parse); + } } } } @@ -1699,6 +1738,27 @@ xml_errorHandler(void *data, xmlErrorPtr error) switch (domain) { case XML_FROM_PARSER: + /* + * One very specific error from the parser will be intercepted here + * if strictness is PG_XML_STRICTNESS_WELLFORMED (meaning the call + * originates in xml_parse), the input is being tentatively parsed + * as content, and the parser has stumbled on what looks like a DTD. + * For that case, just change tentatively_content to false and + * return, and xml_parse will retry the input as a document. + * See xml_parse for motivation. + */ + if (xmlerrcxt->strictness == PG_XML_STRICTNESS_WELLFORMED + && xmlerrcxt->tentatively_content + && error->code == XML_ERR_NAME_REQUIRED + && input != NULL + && input->cur != NULL + && 0 == xmlStrncmp(input->cur, BAD_CAST"!DOCTYPE", 8) + && xml_onlyPrologNodes(ctxt->myDoc)) + { + xmlerrcxt->tentatively_content = false; + return; + } + /* FALLTHROUGH */ case XML_FROM_NONE: case XML_FROM_MEMORY: case XML_FROM_IO: @@ -1811,6 +1871,52 @@ xml_errorHandler(void *data, xmlErrorPtr error) pfree(errorBuf); } +/* + * Check (from the error handler) that the document node's only children already + * parsed are those that can precede a DTD. + * + * To deliver the SQL/XML:2006+ definition of 'content', the error handler above + * can cause a reparse as 'document' if a tentative parse as 'content' trips + * over something that looks like a DTD. Without this check, goofy input like + * <a><!DOCTYPE b></a> could also cause an attempted 'document' reparse, which + * would correctly fail, but the error message could give the wrong reason. + * + * This explores the document that is being constructed. If anything about its + * structure is not as expected, the return is false, so the libxml error will + * be reported, with no other attempt at special handling. + */ +static bool +xml_onlyPrologNodes(xmlDocPtr doc) +{ + xmlNodePtr n; + + if ( doc == NULL ) + return false; + + n = doc->children; /* parseBalancedChunk creates a temporary "pseudoroot" */ + if ( n == NULL || n->type != XML_ELEMENT_NODE ) + return false; + if ( n->name == NULL || 0 != xmlStrcmp(n->name, BAD_CAST"pseudoroot") ) + return false; + + for ( n = n->children; n != NULL; n = n->next ) + { + switch (n->type) + { + case XML_PI_NODE: + case XML_COMMENT_NODE: + continue; + case XML_TEXT_NODE: + if ( xmlIsBlankNode(n) ) + continue; + /* FALLTHROUGH */ + default: + return false; + } + } + return true; +} + /* * Wrapper for "ereport" function for XML-related errors. The "msg" diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out index 2085fa0..0aae600 100644 --- a/src/test/regress/expected/xml.out +++ b/src/test/regress/expected/xml.out @@ -532,6 +532,13 @@ LINE 1: EXECUTE foo ('bad'); DETAIL: line 1: Start tag expected, '<' not found bad ^ +SELECT xml '<!DOCTYPE a><a/><b/>'; +ERROR: invalid XML document +LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>'; + ^ +DETAIL: line 1: Extra content at the end of the document +<!DOCTYPE a><a/><b/> + ^ SET XML OPTION CONTENT; EXECUTE foo ('<bar/>'); xmlconcat @@ -545,6 +552,45 @@ EXECUTE foo ('good'); <foo/>good (1 row) +SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>'; + xml +-------------------------------------------------------------------- + <!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/> +(1 row) + +SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>'; + xml +------------------------------ + <!-- hi--> <!DOCTYPE a><a/> +(1 row) + +SELECT xml '<!DOCTYPE a><a/>'; + xml +------------------ + <!DOCTYPE a><a/> +(1 row) + +SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; +ERROR: invalid XML content +LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; + ^ +DETAIL: line 1: StartTag: invalid element name +<!-- hi--> oops <!DOCTYPE a><a/> + ^ +SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; +ERROR: invalid XML content +LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; + ^ +DETAIL: line 1: StartTag: invalid element name +<!-- hi--> <oops/> <!DOCTYPE a><a/> + ^ +SELECT xml '<!DOCTYPE a><a/><b/>'; +ERROR: invalid XML content +LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>'; + ^ +DETAIL: line 1: Extra content at the end of the document +<!DOCTYPE a><a/><b/> + ^ -- Test backwards parsing CREATE VIEW xmlview1 AS SELECT xmlcomment('test'); CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you'); diff --git a/src/test/regress/expected/xml_1.out b/src/test/regress/expected/xml_1.out index 7f86696..d1a03b5 100644 --- a/src/test/regress/expected/xml_1.out +++ b/src/test/regress/expected/xml_1.out @@ -429,11 +429,53 @@ EXECUTE foo ('<bar/>'); ERROR: prepared statement "foo" does not exist EXECUTE foo ('bad'); ERROR: prepared statement "foo" does not exist +SELECT xml '<!DOCTYPE a><a/><b/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>'; + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. SET XML OPTION CONTENT; EXECUTE foo ('<bar/>'); ERROR: prepared statement "foo" does not exist EXECUTE foo ('good'); ERROR: prepared statement "foo" does not exist +SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?... + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/... + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xml '<!DOCTYPE a><a/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<!DOCTYPE a><a/>'; + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. +SELECT xml '<!DOCTYPE a><a/><b/>'; +ERROR: unsupported XML feature +LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>'; + ^ +DETAIL: This functionality requires the server to be built with libxml support. +HINT: You need to rebuild PostgreSQL using --with-libxml. -- Test backwards parsing CREATE VIEW xmlview1 AS SELECT xmlcomment('test'); CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you'); diff --git a/src/test/regress/expected/xml_2.out b/src/test/regress/expected/xml_2.out index 510b09b..9756b6e 100644 --- a/src/test/regress/expected/xml_2.out +++ b/src/test/regress/expected/xml_2.out @@ -512,6 +512,13 @@ LINE 1: EXECUTE foo ('bad'); DETAIL: line 1: Start tag expected, '<' not found bad ^ +SELECT xml '<!DOCTYPE a><a/><b/>'; +ERROR: invalid XML document +LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>'; + ^ +DETAIL: line 1: Extra content at the end of the document +<!DOCTYPE a><a/><b/> + ^ SET XML OPTION CONTENT; EXECUTE foo ('<bar/>'); xmlconcat @@ -525,6 +532,45 @@ EXECUTE foo ('good'); <foo/>good (1 row) +SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>'; + xml +-------------------------------------------------------------------- + <!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/> +(1 row) + +SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>'; + xml +------------------------------ + <!-- hi--> <!DOCTYPE a><a/> +(1 row) + +SELECT xml '<!DOCTYPE a><a/>'; + xml +------------------ + <!DOCTYPE a><a/> +(1 row) + +SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; +ERROR: invalid XML content +LINE 1: SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; + ^ +DETAIL: line 1: StartTag: invalid element name +<!-- hi--> oops <!DOCTYPE a><a/> + ^ +SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; +ERROR: invalid XML content +LINE 1: SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; + ^ +DETAIL: line 1: StartTag: invalid element name +<!-- hi--> <oops/> <!DOCTYPE a><a/> + ^ +SELECT xml '<!DOCTYPE a><a/><b/>'; +ERROR: invalid XML content +LINE 1: SELECT xml '<!DOCTYPE a><a/><b/>'; + ^ +DETAIL: line 1: Extra content at the end of the document +<!DOCTYPE a><a/><b/> + ^ -- Test backwards parsing CREATE VIEW xmlview1 AS SELECT xmlcomment('test'); CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you'); diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql index 8057a46..71431d8 100644 --- a/src/test/regress/sql/xml.sql +++ b/src/test/regress/sql/xml.sql @@ -149,10 +149,17 @@ PREPARE foo (xml) AS SELECT xmlconcat('<foo/>', $1); SET XML OPTION DOCUMENT; EXECUTE foo ('<bar/>'); EXECUTE foo ('bad'); +SELECT xml '<!DOCTYPE a><a/><b/>'; SET XML OPTION CONTENT; EXECUTE foo ('<bar/>'); EXECUTE foo ('good'); +SELECT xml '<!-- in SQL:2006+ a doc is content too--> <?y z?> <!DOCTYPE a><a/>'; +SELECT xml '<?xml version="1.0"?> <!-- hi--> <!DOCTYPE a><a/>'; +SELECT xml '<!DOCTYPE a><a/>'; +SELECT xml '<!-- hi--> oops <!DOCTYPE a><a/>'; +SELECT xml '<!-- hi--> <oops/> <!DOCTYPE a><a/>'; +SELECT xml '<!DOCTYPE a><a/><b/>'; -- Test backwards parsing