Hi, XML is used for many different kinds of files, and Apache serving up all XML files as having application/xml type unless explicitly told otherwise is suboptimal. For example, correct usage of more specific types can be useful for content negotiation: a user agent might have a preference between 'text/vcard' and 'application/vcard+xml', for example. Just as Apache does for XHTML, these files usually contain enough information to identify their correct type and, when the top-level element has a designated XML namespace, this can be done without any chance of error: determining a more specific media type for an XML document is then a deterministic procedure, not a matter of guesswork.
I can't find any existing solutions using Apache as the HTTP server, though, such as with a module. Is this just something no one has gotten around to implementing yet (either in the Apache HTTP Server project or on their own)? Has anyone solved this problem before? Here's some research I've done on the matter. • It appears there is precedent for using libxml2 to implement functionality in httpd, but the only obvious one is in mod_xml2enc https://httpd.apache.org/docs/trunk/mod/mod_xml2enc.html which is about handling text encodings on-the-fly as a filter. If libxml2 is already used some for Apache modules, then using it to parse an XML document's root element, namespace, and DOCTYPE declaration ought to be pretty straightforward, as a first step to inform the choice of a superior media type. Do any other parts of the server do anything like this? If not, I'm hopeful a quality implementation of this could be considered for inclusion in the core distribution. • To heuristically determine media types of files generally, mod_mime_magic https://httpd.apache.org/docs/trunk/mod/mod_mime_magic.html is described as working "like" the file(1) command. Unfortunately, Apache uses its own home-grown implementation for this job ("This module is derived from a free version of the file(1) command for Unix"), and it expects to be used with a MimeMagicFile in the format of the one supplied by Apache. This means that even when improvements are made to the file(1) command and libmagic library that the majority of libre systems use, it will not trickle down to Apache. Is there a reason for this apparent code duplication? Maybe it comes from a time before the libmagic library https://www.darwinsys.com/file/ existed; curiously, that upstream project is the same as what mod_mime_magic is based on anyway. • This subject matter is based on the premise that the name of an XML document's root element, along with at least one of a document type declaration or an XML namespace declaration, can uniquely identify an XML document's kind and inform user agents of how to use those XML files fetched. This can be materialized from two different approaches, neither of which I've been able to pull off. ◦ The IANA has the registration of a media type called application/prs.implied-document+xml https://www.iana.org/assignments/media-types/application/prs.implied-document+xml which allows this concept in general. The comments for the registration express this well: > This media type identifies a meta-format that encompasses all XML-based > formats which are identified by a particular name of the root element, > optionally together with a namespace URI or the PUBLIC identifier stored in > the DTD. It it intended for use in applications that describe files using > media types, but do not have sufficient heuristics to output a more specific > media type. In such a case, the application may parse XML and use the name of > the root element and the DTD to the "root", "ns", and "public" parameters. It even gives an example: the common image/svg+xml type is approximately equivalent to application/prs.implied-document+xml;root=svg;ns="http://www.w3.org/2000/svg";public="-//W3C//DTD SVG 1.1//EN" If something somewhere in the pipeline could express a media type like this, then the canonical image/svg+xml could be substituted as an alias somewhere. ◦ An orthogonal issue is, in what way could we define such "canonical media types" that correspond to some XML document type? I am pleased to discover that the shared-mime-info database specification, commonly used on GNU/Linux, already provides for this! https://specifications.freedesktop.org/shared-mime-info/latest/ar01s02.html#id-1.3.9 As a matter of fact, on my Debian Trixie system, a file /usr/share/mime/XMLnamespaces already exists. This is a short plain text file with lines such as > http://www.abisource.com/awml.dtd abiword application/x-abiword > http://www.w3.org/1998/Math/MathML math application/mathml+xml > http://www.w3.org/1999/xhtml html application/xhtml+xml and so on. So regardless of whether the application/prs.implied-document+xml media type is used somewhere as an internal representation, this is a straightforward mapping that provides everything needed. If only Apache could use it. If any solutions exist along these lines, I don't know of them yet but would love to. Otherwise, I ask of the sympathetic readership: how do you handle this?
signature.asc
Description: This is a digitally signed message part
