Hello libxml2 maintainers, Short version, note the xmlURI struct from uri.h, libxml2 version 2.9.2:
/** * xmlURI: * * A parsed URI reference. This is a struct containing the various fields * as described in RFC 2396 but separated for further processing. * * Note: query is a deprecated field which is incorrectly unescaped. * query_raw takes precedence over query if the former is set. * See: http://mail.gnome.org/archives/xml/2007-April/thread.html#00127 */ typedef struct _xmlURI xmlURI; typedef xmlURI *xmlURIPtr; struct _xmlURI { char *scheme; /* the URI scheme */ char *opaque; /* opaque part */ char *authority; /* the authority part */ char *server; /* the server part */ char *user; /* the user part */ int port; /* the port number */ char *path; /* the path string */ char *query; /* the query string (deprecated - use with caution) */ char *fragment; /* the fragment identifier */ int cleanup; /* parsing potentially unclean URI */ char *query_raw; /* the query string (as it appears in the URI) */ }; Next to 'query_raw' it would be useful to have 'server_raw', 'user_raw', 'path_raw' and 'fragment_raw' that take precedence over the existing struct members. === Long version: We use libxml2/libxslt for serverside xslt processing of browser pages. To allow xslt stylesheets from other domains we use a proxy that is supplied with the original url in encoded form. An example (demo) is this: <?xml version="1.0" encoding="UTF-8" ?> <?xml-stylesheet type="text/xsl" href="/get/BASE=http%3A%2F%2Fservername%3A80%2F%7Eaccountname%2Fdirectory/%3Fid%3DSCREEN_ID%26name%3Dvalue"?> This external stylesheet is loaded using xsltLoadStylesheetPI(). Beforehand we've called xsltSetLoaderFunc() to have control over the documents that are loaded during the transformation, which are the stylesheet itself as well as sub-documents.The problem is that the function set by xsltSetLoaderFunc() gets mangled urls. E.g. the above url is transformed to: http://<ip-address>:<port>/get/BASE=http%3A//servername%3A80/~accountname/directory/%3Fid=SCREEN_ID&name=value This cannot be repaired outside the library because we cannot not know what parts to url-encode to get back the original url. Note that in this example "%3A" and "%3F" are still intact. Url-encoding the whole string would result in double encoding of these parts. It would also encode all forward slashes '/' instead of only those that were decoded from "%2F". A closer look reveals what goes wrong. xmlBuildURI() indirectly calls xmlURIUnescapeString() which url-decodes all percent-encoded entities and finally xmlSaveUri() constructs the above output string while url-encoding special characters ':' and '?', but not characters like '/' and '&'. Imho, a better approach would be to skip decoding/encoding entirely and use raw parts that are glued together before handing them over to the outside. If you look at this function: /** * xmlParse3986URI: * @uri: pointer to an URI structure * @str: the string to analyze * * Parse an URI string and fills in the appropriate fields * of the @uri structure * * scheme ":" hier-part [ "?" query ] [ "#" fragment ] * * Returns 0 or the error code */ then it would make sense to divide the input by ":", "?" and "#" and save all parts in raw format. When constructing a url, xmlSaveUri() can simply glue all parts together with ":", "?" and "#" in between. But I only see query_raw stored in the xmlURI struct. What about the other struct members that got their value through xmlURIUnescapeString()? Kind regards, Martin Zwaal OCLC B.V. ยท Software Engineer Schipholweg 99, P.O. Box 876 2300 AW Leiden The Netherlands T +31 (0)71 524 678 _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml