Hello libxml2 maintainers,

Short version, note the xmlURI struct from uri.h, libxml2 version 2.9.2:

/**
 * xmlURI:
 *
 * A parsed URI reference. This is a struct containing the various fields
 * as described in RFC 2396 but separated for further processing.
 *
 * Note: query is a deprecated field which is incorrectly unescaped.
 * query_raw takes precedence over query if the former is set.
 * See: http://mail.gnome.org/archives/xml/2007-April/thread.html#00127
 */
typedef struct _xmlURI xmlURI;
typedef xmlURI *xmlURIPtr;
struct _xmlURI {
    char *scheme;       /* the URI scheme */
    char *opaque;       /* opaque part */
    char *authority;    /* the authority part */
    char *server;       /* the server part */ 
    char *user;         /* the user part */
    int port;           /* the port number */
    char *path;         /* the path string */
    char *query;        /* the query string (deprecated - use with caution) */
    char *fragment;     /* the fragment identifier */
    int  cleanup;       /* parsing potentially unclean URI */
    char *query_raw;    /* the query string (as it appears in the URI) */
};  

Next to 'query_raw' it would be useful to have 'server_raw', 'user_raw', 
'path_raw' and 'fragment_raw' that take precedence over the existing struct 
members.

===

Long version:

We use libxml2/libxslt for serverside xslt processing of browser pages. To 
allow xslt stylesheets from other domains we use a proxy that is supplied with 
the original url in encoded form. An example (demo) is this: 

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" 
href="/get/BASE=http%3A%2F%2Fservername%3A80%2F%7Eaccountname%2Fdirectory/%3Fid%3DSCREEN_ID%26name%3Dvalue"?>

This external stylesheet is loaded using xsltLoadStylesheetPI(). Beforehand 
we've called xsltSetLoaderFunc() to have control over the documents that are 
loaded during the transformation, which are the stylesheet itself as well as 
sub-documents.The problem is that the function set by xsltSetLoaderFunc() gets 
mangled urls. E.g. the above url is transformed to:

http://<ip-address>:<port>/get/BASE=http%3A//servername%3A80/~accountname/directory/%3Fid=SCREEN_ID&name=value

This cannot be repaired outside the library because we cannot not know what 
parts to url-encode to get back the original url. Note that in this example 
"%3A" and "%3F" are still intact. Url-encoding the whole string would result in 
double encoding of these parts. It would also encode all forward slashes '/' 
instead of only those that were decoded from "%2F".

A closer look reveals what goes wrong. xmlBuildURI() indirectly calls 
xmlURIUnescapeString() which url-decodes all percent-encoded entities and 
finally xmlSaveUri() constructs the above output string while url-encoding 
special characters ':' and '?', but not characters like '/' and '&'. Imho, a 
better approach would be to skip decoding/encoding entirely and use raw parts 
that are glued together before handing them over to the outside. If you look at 
this function:

/**
 * xmlParse3986URI:
 * @uri:  pointer to an URI structure
 * @str:  the string to analyze
 *
 * Parse an URI string and fills in the appropriate fields
 * of the @uri structure
 *
 * scheme ":" hier-part [ "?" query ] [ "#" fragment ]
 *
 * Returns 0 or the error code
 */

then it would make sense to divide the input by ":", "?" and "#" and save all 
parts in raw format. When constructing a url, xmlSaveUri() can simply glue all 
parts together with ":", "?" and "#" in between. But I only see query_raw 
stored in the xmlURI struct. What about the other struct members that got their 
value through xmlURIUnescapeString()?
  
Kind regards,

Martin Zwaal
OCLC B.V. ยท Software Engineer
Schipholweg 99, P.O. Box 876 2300 AW Leiden The Netherlands
T +31 (0)71 524 678


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to