Hannu Krosing wrote: > > In that case, treat the XML document like a binary stream, using > > PQescapeBytea, etc. to encode if necessary pre-query. Also, the XML > > domain should inherit from bytea, not varchar. > > why ? > > the allowed characters repertoire in XML is even less than in varchar.
Yes, that is correct. I was resorting to hyperbole...see my reasoning below. > > The document should be stored bit for bit as was submitted. > > Or in some pre-parsed form which allows restoration of submitted form, > which could be more for things like xpath queries or subtree extraction. This is the crucial point: I'll try and explain my thinking better. > > OTOH, if we are transforming the document down to a more generic format > > (either canonical or otherwise), then the xml could be dealt with like > > text in the ususal way. Of course, then we are not really storing xml, > > more like 'meta' xml ;) > > On the contrary! If there is DTD or Schema or other structure definition > for XML, then we know which whitespace is significant and can do > whatever we like with insignificant whitespace. According to the XML standard, whitespace is always significant unless it is outside an element or attribute and thus not part of the real data. A DTD or Schema adds constraints, not removes them. I'm nitpicking, but this is extra evidence to my philosophy of xml storage that I'll explain below. > select > '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml Right: I understand your reasoning here. Here is the trick: select '[...]'::xml introduces a casting step which justifies a transformation. The original input data is not xml, but varchar. Since there are no arbitrary rules on how to do this, we have some flexibility here to do things like change the encoding/mess with the whitespace. I am trying to find away to break the assumption that my xml data necessarily has to be converted from raw text. My basic point is that we are confusing the roles of storing and parsing/transformation. The question is: are we storing xml documents or the metadata that makes up xml documents? We need to be absolutely clear on which role the server takes on...in fact both roles may be appropriate for different situations, but should be represented by a different type. I'll try and give examples of both situations. If we are strictly storing documents, IMO the server should perform zero modification on the document. Validation could be applied conceptually as a constraint (and, possibly XSLT/XPATH to allow a fancy type of indexing). However there is no advantage that I can see to manipulating the document except to break the 'C' of ACID. My earlier comments wrt binary encoding is that there simply has to be a way to prevent the server mucking with my document. For example, if I was using postgres to store XML-EDI documents in a DX system this is the role I would prefer. Validation and indexing are useful, but my expected use of the server is a type of electronic xerox of the incoming document. I would be highly suspicious of any modification the server made to my document for any reason. Now, if we are storing xml as content, say for a content database backing a web page, the server takes on the role of a meta-data storage system. Now, it is reasonable to assume the server might do additional processing besides storage and validation. The character encoding of the incoming data is of little or no importance because the xml will almost certainly undergo an additional transformation step after extraction from the database. Flexibility, simplicity, and utility are the most important requirements, so text transformation to a default encoding would be quite useful. Based on your suggestions I think you are primarily concerned with the second example. However, in my work I do a lot of DX and I see the xml document as a binary object. Server-side validation would be extremely helpful, but please don't change my document! So, I submit that we are both right for different reasons. Regards, Merlin ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match