Hi Florian, I tested this patch using two systems, a CentOS 5 box with libxml2-2.6.26-2.1.2.8.el5_5.1 and an Ubuntu 8.04 box with libxml2 2.6.31.dfsg-2ubuntu1.6. Both failed to build with this error:
xml.c: In function `pg_xml_init': xml.c:934: error: `xmlStructuredErrorContext' undeclared (first use in this function) xml.c:934: error: (Each undeclared identifier is reported only once xml.c:934: error: for each function it appears in.) make[4]: *** [xml.o] Error 1 It seems `xmlStructuredErrorContext' was added rather recently. It's not obvious which version added it, but 2.7.8 has it and 2.7.2 does not. Looking at 2.6.26's sources, that version seems to use `xmlGenericErrorContext' for both structured and generic handler functions. Based on that, I tried with a change from `xmlStructuredErrorContext' to `xmlGenericErrorContext' in our xml.c. I ran the test suite and hit some failures in the xml test; see attached regression.diffs. I received warnings where the tests expected nothing or expected errors. Looks like my version of libxml2 (2.6.26 in this case) classifies some of these messages differently. I suggest testing the next version with both libxml2 2.6.23, the minimum supported version, and libxml2 2.7.8, the latest version. On Thu, Jun 09, 2011 at 06:11:46PM +0200, Florian Pflug wrote: > On Jun2, 2011, at 01:34 , Florian Pflug wrote: > > On Jun2, 2011, at 00:02 , Noah Misch wrote: > >> On Wed, Jun 01, 2011 at 06:16:21PM +0200, Florian Pflug wrote: > >>> Anyway, I'll try to come up with a patch that replaces > >>> xmlSetGenericErrorFunc() with xmlSetStructuredErrorFunc(). > >> > >> Sounds sensible. Will this impose any new libxml2 version dependency? > > > > xmlSetStructuredErrorFunc() seems to be available starting with libxml > > 2.6.0, > > release on Oct 20, 2003. Since we already require the version to be >= > > 2.6.23, > > we should be OK. > > > > I won't have access to my PC the next few days, but I'll try to come up with > > a patch some time next week. > > Phew... I did manage to produce a patch, but it was way more work than I > had intended to put into this. > > As it turns out, you loose the nicely formatted context information that > libxml2 provides via the generic error func once you switch to structured > error reporting. Registering handlers for both doesn't help either, since > the generic error handler isn't called once you register a structured one. > > Fortunately, libxml does export xmlParserPrintFileContext() which > generates these context messages. It, however, doesn't return a string, > but instead passes them to the generic error handler (this time, independent > from whether a structural error handler is registered or not). Interesting API, there. > As it stood, the code assumed that all third-party library re-install > their libxml error handlers before each library call, and thus didn't > bother to restore the old error handler itself. Since I revamped the > error handling anyway, I removed that requirement. There is now a > function pg_xml_done() which restores the original error handler that > we overwrote in pg_xml_init(). Seems reasonable. > I also realized that some libxml error (like undefined namespace prefixes) > must be ignored during xmlparse() and friends. Otherwise, it becomes > impossible to build XML documents from individual fragments. pg_xml_init() > therefore now takes an argument which specifies how strict the error > checking is supposed to be. For the moment, only XPATH() uses the strict > mode in which we report all errors. XMLPARSE() and friends only report > parse errors, not namespace errors. Seems fair offhand. We must preserve the invariant that any object with type "xml" can be passed through xml_out() and then be accepted by xml_in(). I'm not seeing any specific risks there, but I figure I'd mention it. I couldn't reconcile this description with the code. I do see that xpath_internal() uses XML_PURPOSE_OTHER, but so does xmlelement(). Why is xmlelement() also ready for the strictest error reporting? Also note that we may be able to be more strict when xmloption = document. > > Finally, I had to adjust contrib/xml2 because it uses some parts of > the core XML support like pg_xml_init(). > > Heres the indended behaviour with the patch applied: > ---------------------------------------------------- > > We always use structured error handling. For now, the error messages > pretty much resemble the old ones, but it's now easy to add additional > information. > > XMLPARSE() and casting to XML check for parse errors only, like they do > without the patch. They're also capable of reporting warnings, but I > didn't find a case where the libxml parser generates a warning. > > XPATH() reports all errors and warnings. Trying to use XPATH() on > a document with e.g. inconsistent namespace usage or invalid > namespace URIs therefore now raises an error. This is *necessary* > because libxml's XPath evaluator gets confused if it encounters > e.g. invalid namespace URI and outputs invalid XML in response. Ideally, an invalid namespace URI should always be an error. While namespace prefixes could become valid as you assemble a larger document, a namespace URI's validity is context-free. > contrib/xml2's behaviour hasn't changed. > > Patch is attached, and comments are welcome. As far as the code itself, a few comments: The patch adds some trailing whitespace. > *** a/contrib/xml2/xpath.c > --- b/contrib/xml2/xpath.c > *************** xpath_table(PG_FUNCTION_ARGS) > *** 720,725 **** > --- 726,732 ---- > if (comppath == NULL) > { > xmlFreeDoc(doctree); > + pg_xml_done(); > xml_ereport(ERROR, > ERRCODE_EXTERNAL_ROUTINE_EXCEPTION, > "XPath > Syntax Error"); > } Remembering to put a pg_xml_done() in every relevant elog(ERROR, ...) path seems fragile. How about using RegisterXactCallback/RegisterSubXactCallback in pgxml_parser_init() to handle those cases? You can also use it to Assert at non-abort transaction shutdown that pg_xml_done() was called as needed. > *************** xmlelement(XmlExprState *xmlExpr, ExprCo > *** 597,614 **** > } > > /* now safe to run libxml */ > ! pg_xml_init(); > > PG_TRY(); > { > buf = xmlBufferCreate(); > ! if (!buf) > ! xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY, > ! "could not allocate xmlBuffer"); > writer = xmlNewTextWriterMemory(buf, 0); > ! if (!writer) > ! xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY, > ! "could not allocate > xmlTextWriter"); > > xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name); > > --- 607,622 ---- > } > > /* now safe to run libxml */ > ! pg_xml_init(XML_PURPOSE_OTHER); > > PG_TRY(); > { > buf = xmlBufferCreate(); > ! XML_REPORT_ERROR(buf != NULL, ERROR, ERRCODE_OUT_OF_MEMORY, > ! "could not allocate > xmlBuffer"); > writer = xmlNewTextWriterMemory(buf, 0); > ! XML_REPORT_ERROR(writer != NULL, ERROR, ERRCODE_OUT_OF_MEMORY, > ! "could not allocate > xmlTextWriter"); > > xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name); > Consider renaming XML_REPORT_ERROR it to something like XML_CHECK_ERROR, emphasizing that it wraps a conditional. > *************** xml_parse(text *data, XmlOptionType xmlo > *** 1175,1182 **** > int32 len; > xmlChar *string; > xmlChar *utf8string; > ! xmlParserCtxtPtr ctxt; > ! xmlDocPtr doc; > > len = VARSIZE(data) - VARHDRSZ; /* will be useful later */ > string = xml_text2xmlChar(data); > --- 1244,1251 ---- > int32 len; > xmlChar *string; > xmlChar *utf8string; > ! xmlParserCtxtPtr ctxt = NULL; > ! xmlDocPtr doc = NULL; > > len = VARSIZE(data) - VARHDRSZ; /* will be useful later */ > string = xml_text2xmlChar(data); Is this change needed? > *************** void > *** 1337,1366 **** > xml_ereport(int level, int sqlcode, const char *msg) > { > char *detail; > ! > /* > ! * It might seem that we should just pass xml_err_buf->data directly to > ! * errdetail. However, we want to clean out xml_err_buf before throwing > * error, in case there is another function using libxml further down > the > * call stack. > */ > ! if (xml_err_buf->len > 0) > { > ! detail = pstrdup(xml_err_buf->data); > ! resetStringInfo(xml_err_buf); > } > else > detail = NULL; > > if (detail) > { > - size_t len; > - > - /* libxml error messages end in '\n'; get rid of it */ > - len = strlen(detail); > - if (len > 0 && detail[len - 1] == '\n') > - detail[len - 1] = '\0'; > - > ereport(level, > (errcode(sqlcode), > errmsg("%s", msg), > --- 1408,1430 ---- > xml_ereport(int level, int sqlcode, const char *msg) > { > char *detail; > ! > /* > ! * It might seem that we should just pass xml_error_detail_buf->data > directly to > ! * errdetail. However, we want to clean out xml_error_detail_buf > before throwing > * error, in case there is another function using libxml further down > the > * call stack. > */ > ! if (xml_error_detail_buf->len > 0) > { > ! detail = pstrdup(xml_error_detail_buf->data); > ! resetStringInfo(xml_error_detail_buf); The xml_err_buf -> xml_error_detail_buf change should be its own patch, if you think it's important. Changing the name at the same time makes it harder to verify this patch. > } > else > detail = NULL; > > if (detail) > { > ereport(level, > (errcode(sqlcode), > errmsg("%s", msg), > *************** xml_ereport(int level, int sqlcode, cons > *** 1376,1402 **** > > > /* > ! * Error handler for libxml error messages > */ > static void > ! xml_errorHandler(void *ctxt, const char *msg,...) > { > ! /* Append the formatted text to xml_err_buf */ > ! for (;;) > { > ! va_list args; > ! bool success; > > - /* Try to format the data. */ > - va_start(args, msg); > - success = appendStringInfoVA(xml_err_buf, msg, args); > - va_end(args); > > ! if (success) > break; > ! > ! /* Double the buffer size and try again. */ > ! enlargeStringInfo(xml_err_buf, xml_err_buf->maxlen); > } > } > > --- 1440,1566 ---- > > > /* > ! * Append a newline after removing all existing trailing newlines > */ > static void > ! appendStringInfoLineSeparator(StringInfo str) > { > ! chopStringInfoNewlines(str); > ! > ! if (str->len > 0) > ! appendStringInfoChar(str, '\n'); > ! } > ! > ! > ! /* > ! * Remove all trailing newlines > ! */ > ! static void > ! chopStringInfoNewlines(StringInfo str) > ! { > ! while ((str->len > 0) && > ! (str->data[str->len-1] == '\n')) > { > ! str->data[--str->len] = '\0'; > ! } > ! } Is there something about this patch's other changes that make it necessary to strip multiple newlines, vs. one newline as we did before, and to do it in a different place in the code? > > > ! /* > ! * Error handler for libxml errors and warnings > ! */ > ! static void > ! xml_errorHandler(void* data, xmlErrorPtr error) > ! { > ! /* Buffer to hold the error detail */ > ! StringInfo errorBuf = makeStringInfo(); > ! > ! /* Get parser and input context */ > ! xmlParserCtxtPtr ctxt = (xmlParserCtxtPtr) error->ctxt; > ! xmlParserInputPtr input = (ctxt != NULL) ? ctxt->input : NULL; > ! xmlNodePtr node = error->node; > ! const xmlChar* name = ((node != NULL) && (node->type == > XML_ELEMENT_NODE)) ? > ! node->name : NULL; > ! > ! switch (error->domain) { > ! case XML_FROM_NONE: > ! case XML_FROM_PARSER: > ! case XML_FROM_MEMORY: > ! case XML_FROM_IO: > ! /* Accept regardless of the parsing purpose */ > ! break; > ! > ! default: > ! /* Ignore during well-formedness check */ > ! if (xml_purpose == XML_PURPOSE_WELLFORMED) > ! return; > break; > ! } > ! > ! /* Append error message to xml_error_detail_buf */ > ! if ((error->domain == XML_FROM_PARSER) && (error->line > 0)) > ! appendStringInfo(errorBuf, "line %d: ", error->line); > ! if (name != NULL) > ! appendStringInfo(errorBuf, "element %s: ", name); > ! appendStringInfo(errorBuf, "%s", error->message); > ! > ! /* Append context information to xml_error_detail_buf. > ! * xmlParserPrintFileContext() uses the *generic* error handle to > ! * write the context. Since we don't want to duplicate libxml > ! * functionality here, we setup a generic error handler temporarily > ! */ > ! if (input != NULL) > ! { > ! /* Save generic error func and setup the xml_error_detail_buf > ! * appender as generic error func. This should work because > ! * appendStringInfo() has essentially the same signature as > ! * xmlGenericErrorFunc(). > ! */ > ! xmlGenericErrorFunc errFuncSaved = xmlGenericError; > ! void* errCtxSaved = xmlGenericErrorContext; > ! xmlSetGenericErrorFunc(errorBuf, > (xmlGenericErrorFunc)appendStringInfo); > ! > ! /* Generic context information */ > ! appendStringInfoLineSeparator(errorBuf); > ! xmlParserPrintFileContext(input); > ! > ! /* Restore generic error func */ > ! xmlSetGenericErrorFunc(errCtxSaved, errFuncSaved); > ! } > ! > ! chopStringInfoNewlines(errorBuf); > ! > ! /* Legacy error handling. The error flag is never set. Exists because > ! * the xml2 contrib module uses our error-handling infrastructure, but > ! * we don't want to change its behaviour since it's deprecated anyway > ! */ > ! if (xml_purpose == XML_PURPOSE_LEGACY) > ! { > ! appendStringInfoLineSeparator(xml_error_detail_buf); > ! appendStringInfoString(xml_error_detail_buf, errorBuf->data); > ! return; > ! } > ! > ! /* We don't want to ereport() here because that'd probably leave > ! * libxml in an inconsistent state. Instead, we remember the > ! * error and ereport() from xml_ereport(). > ! * > ! * Warnings and notices are reported immediatly since they don't cause a typo > ! * longjmp() out of libxml. > ! */ > ! if (error->level >= XML_ERR_ERROR) > ! { > ! appendStringInfoLineSeparator(xml_error_detail_buf); > ! appendStringInfoString(xml_error_detail_buf, errorBuf->data); > > ! xml_error_occured = true; > ! } > ! else if (error->level >= XML_ERR_WARNING) > ! { > ! ereport(WARNING, (errmsg("%s", errorBuf->data))); > ! } > ! else > ! { > ! ereport(NOTICE, (errmsg("%s", errorBuf->data))); > } > } > > diff --git a/src/include/utils/xml.h b/src/include/utils/xml.h > index 6359cd6..6735521 100644 > *** a/src/include/utils/xml.h > --- b/src/include/utils/xml.h > *************** typedef enum > *** 68,74 **** > XML_STANDALONE_OMITTED > } XmlStandaloneType; > > ! extern void pg_xml_init(void); > extern void xml_ereport(int level, int sqlcode, const char *msg); > extern xmltype *xmlconcat(List *args); > extern xmltype *xmlelement(XmlExprState *xmlExpr, ExprContext *econtext); > --- 68,93 ---- > XML_STANDALONE_OMITTED > } XmlStandaloneType; > > ! typedef enum > ! { > ! XML_PURPOSE_NONE /* Don't setup error handler. pg_xml_done() not > required */, It's odd to have a purpose of "NONE" that pg_xml_init() callers can actually use. How about XML_PURPOSE_TRIVIAL, for callers that only call libxml2 functions that can't error? Also, I think it would be better to require a call to pg_xml_done() in all cases, just to keep things simple. Actually, I wonder if it's worth having XML_PURPOSE_TRIVIAL for the one user thereof. It doesn't save much. > ! XML_PURPOSE_LEGACY /* Save error message only, don't set error flag */, It's fine to set the flag for legacy users, considering they could just not read it. The important distinction seems to be that we don't emit warnings or notices in this case. Is that correct? If so, the comment should reflect that emphasis. Then, consider updating the flag unconditionally. > ! XML_PURPOSE_WELLFORMED /* Ignore non-parser errors, nothing else*/, > ! XML_PURPOSE_OTHER /* Report all errors */ > ! } XmlPurposeType; This is currently about error reporting only. How about: typedef enum { XML_ER_NONE, /* no calls to libxml2 functions permitted */ XML_ER_WELLFORMED, /* report errors and warnings from the parser only */ XML_ER_LEGACY, /* report all errors, no warnings */ XML_ER_ALL /* report all errors and warnings */ } XmlErrorReporting; > ! > ! extern bool xml_error_occured; > ! > ! #define XML_REPORT_ERROR(assertion,level,sqlcode,msg) \ > ! if (pg_xml_erroroccurred() || !(assertion)) { \ > ! xml_ereport(level, sqlcode, msg); \ > ! } \ > ! else { \ > ! } Consider moving this definition into xml.c. It's more of an internal shorthand for keeping that file simple than part of the API. > ! > ! extern void pg_xml_init(XmlPurposeType purpose); > ! extern void pg_xml_done(void); > ! extern bool pg_xml_erroroccurred(void); > extern void xml_ereport(int level, int sqlcode, const char *msg); > extern xmltype *xmlconcat(List *args); > extern xmltype *xmlelement(XmlExprState *xmlExpr, ExprContext *econtext); > diff --git a/src/test/regress/expected/xml.out > b/src/test/regress/expected/xml.out > index eaa5a74..da2f290 100644 > *** a/src/test/regress/expected/xml.out > --- b/src/test/regress/expected/xml.out > *************** INSERT INTO xmltest VALUES (3, '<wrong') > *** 8,14 **** > ERROR: invalid XML content > LINE 1: INSERT INTO xmltest VALUES (3, '<wrong'); > ^ > ! DETAIL: Entity: line 1: parser error : Couldn't find end of Start Tag > wrong line 1 > <wrong > ^ > SELECT * FROM xmltest; > --- 8,14 ---- > ERROR: invalid XML content > LINE 1: INSERT INTO xmltest VALUES (3, '<wrong'); > ^ > ! DETAIL: line 1: Couldn't find end of Start Tag wrong line 1 Any reason to change the error text this way? Thanks, nm
*** /home/nmisch/src/cvs/postgresql/src/test/regress/expected/xml.out 2011-06-20 00:51:56.000000000 -0400 --- /home/nmisch/src/cvs/postgresql/src/test/regress/results/xml.out 2011-06-20 01:19:20.000000000 -0400 *************** *** 223,234 **** --- 223,240 ---- <undefinedentity>&idontexist;</undefinedentity> ^ SELECT xmlparse(content '<invalidns xmlns=''<''/>'); + WARNING: line 1: xmlns: < not a valid URI + <invalidns xmlns='<'/> + ^ xmlparse --------------------------- <invalidns xmlns='<'/> (1 row) SELECT xmlparse(content '<relativens xmlns=''relative''/>'); + WARNING: line 1: xmlns: URI relative is not absolute + <relativens xmlns='relative'/> + ^ xmlparse -------------------------------- <relativens xmlns='relative'/> *************** *** 279,290 **** --- 285,302 ---- <undefinedentity>&idontexist;</abc> ^ SELECT xmlparse(document '<invalidns xmlns=''<''/>'); + WARNING: line 1: xmlns: < not a valid URI + <invalidns xmlns='<'/> + ^ xmlparse --------------------------- <invalidns xmlns='<'/> (1 row) SELECT xmlparse(document '<relativens xmlns=''relative''/>'); + WARNING: line 1: xmlns: URI relative is not absolute + <relativens xmlns='relative'/> + ^ xmlparse -------------------------------- <relativens xmlns='relative'/> *************** *** 779,790 **** --- 791,808 ---- (1 row) SELECT xml_is_well_formed('<invalidns xmlns=''<''/>'); + WARNING: line 1: xmlns: < not a valid URI + <invalidns xmlns='<'/> + ^ xml_is_well_formed -------------------- t (1 row) SELECT xml_is_well_formed('<relativens xmlns=''relative''/>'); + WARNING: line 1: xmlns: URI relative is not absolute + <relativens xmlns='relative'/> + ^ xml_is_well_formed -------------------- t *************** *** 812,822 **** -- which is invalid beecause '<' may not appear un-escaped in -- attribute values. SELECT xpath('/*', '<invalidns xmlns=''<''/>'); ! ERROR: could not parse XML document ! DETAIL: xmlns: '<' is not a valid URI <invalidns xmlns='<'/> ^ CONTEXT: SQL function "xpath" statement 1 -- Again, the XML isn't well-formed for namespace purposes SELECT xpath('/*', '<nosuchprefix:tag/>'); ERROR: could not parse XML document --- 830,849 ---- -- which is invalid beecause '<' may not appear un-escaped in -- attribute values. SELECT xpath('/*', '<invalidns xmlns=''<''/>'); ! WARNING: line 1: xmlns: < not a valid URI ! <invalidns xmlns='<'/> ! ^ ! LINE 1: SELECT xpath('/*', '<invalidns xmlns=''<''/>'); ! ^ ! WARNING: line 1: xmlns: < not a valid URI <invalidns xmlns='<'/> ^ CONTEXT: SQL function "xpath" statement 1 + xpath + ------------------------------ + {"<invalidns xmlns=\"<\"/>"} + (1 row) + -- Again, the XML isn't well-formed for namespace purposes SELECT xpath('/*', '<nosuchprefix:tag/>'); ERROR: could not parse XML document *************** *** 827,833 **** -- XPath deprecates relative namespaces, but they're not supposed to -- thrown an error, only a warning. SELECT xpath('/*', '<relativens xmlns=''relative''/>'); ! WARNING: xmlns: URI relative is not absolute <relativens xmlns='relative'/> ^ CONTEXT: SQL function "xpath" statement 1 --- 854,865 ---- -- XPath deprecates relative namespaces, but they're not supposed to -- thrown an error, only a warning. SELECT xpath('/*', '<relativens xmlns=''relative''/>'); ! WARNING: line 1: xmlns: URI relative is not absolute ! <relativens xmlns='relative'/> ! ^ ! LINE 1: SELECT xpath('/*', '<relativens xmlns=''relative''/>'); ! ^ ! WARNING: line 1: xmlns: URI relative is not absolute <relativens xmlns='relative'/> ^ CONTEXT: SQL function "xpath" statement 1 ======================================================================
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers