Re: [HACKERS] Encoding problems in PostgreSQL with XML data

Merlin Moncure Thu, 15 Jan 2004 09:50:54 -0800

Hannu Krosing wrote:

> > In that case, treat the XML document like a binary stream, using
> > PQescapeBytea, etc. to encode if necessary pre-query.  Also, the XML
> > domain should inherit from bytea, not varchar.
> 
> why ?
> 
> the allowed characters repertoire in XML is even less than in varchar.


Yes, that is correct.  I was resorting to hyperbole...see my reasoning
below.

> > The document should be stored bit for bit as was submitted.
> 
> Or in some pre-parsed form which allows restoration of submitted form,
> which could be more for things like xpath queries or subtree
extraction.

This is the crucial point:  I'll try and explain my thinking better.

> > OTOH, if we are transforming the document down to a more generic
format
> > (either canonical or otherwise), then the xml could be dealt with
like
> > text in the ususal way.  Of course, then we are not really storing
xml,
> > more like 'meta' xml ;)
> 
> On the contrary! If there is DTD or Schema or other structure
definition
> for XML, then we know which whitespace is significant and can do
> whatever we like with insignificant whitespace.

According to the XML standard, whitespace is always significant unless
it is outside an element or attribute and thus not part of the real
data.  A DTD or Schema adds constraints, not removes them.  I'm
nitpicking, but this is extra evidence to my philosophy of xml storage
that I'll explain below.

> select
> '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml

Right: I understand your reasoning here.  Here is the trick:

select '[...]'::xml introduces a casting step which justifies a
transformation.  The original input data is not xml, but varchar.  Since
there are no arbitrary rules on how to do this, we have some flexibility
here to do things like change the encoding/mess with the whitespace.  I
am trying to find away to break the assumption that my xml data
necessarily has to be converted from raw text.

My basic point is that we are confusing the roles of storing and
parsing/transformation.  The question is: are we storing xml documents
or the metadata that makes up xml documents?  We need to be absolutely
clear on which role the server takes on...in fact both roles may be
appropriate for different situations, but should be represented by a
different type.  I'll try and give examples of both situations.

If we are strictly storing documents, IMO the server should perform zero
modification on the document.  Validation could be applied conceptually
as a constraint (and, possibly XSLT/XPATH to allow a fancy type of
indexing).  However there is no advantage that I can see to manipulating
the document except to break the 'C' of ACID.  My earlier comments wrt
binary encoding is that there simply has to be a way to prevent the
server mucking with my document.

For example, if I was using postgres to store XML-EDI documents in a DX
system this is the role I would prefer.  Validation and indexing are
useful, but my expected use of the server is a type of electronic xerox
of the incoming document.  I would be highly suspicious of any
modification the server made to my document for any reason.  

Now, if we are storing xml as content, say for a content database
backing a web page, the server takes on the role of a meta-data storage
system.  Now, it is reasonable to assume the server might do additional
processing besides storage and validation.  The character encoding of
the incoming data is of little or no importance because the xml will
almost certainly undergo an additional transformation step after
extraction from the database.  Flexibility, simplicity, and utility are
the most important requirements, so text transformation to a default
encoding would be quite useful.

Based on your suggestions I think you are primarily concerned with the
second example.  However, in my work I do a lot of DX and I see the xml
document as a binary object.  Server-side validation would be extremely
helpful, but please don't change my document!

So, I submit that we are both right for different reasons.

Regards,
Merlin


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [HACKERS] Encoding problems in PostgreSQL with XML data

Reply via email to