The following bug has been logged on the website: Bug reference: 8469 Logged by: Dennis Email address: dennis.noord...@helsinki.fi PostgreSQL version: 9.3.0 Operating system: FreeBSD 9.2-RC4 Description:
Hi, After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields containing "&" where they are populated from XML. This may be a coincidence and the problem may have existed earlier, in any case, now I noticed. I extract the text content of XML nodes using xpath, from something like: <name>Jones & Smith</name> The reason I end up with "&" is the IMHO rather odd xpath behaviour: # select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))); xpath --------------- {"A & B"} The canonical contents of "a" is "A & B". At first search I've found some rather heated debates about this with bits of name calling; I certainly do not want to get into that and I apologize in advance to those who feel very strongly about this. I've seen one "fix" describe the problem as: ""DESCRIPTION: Submitter invokes following statement: SELECT (XPATH('/*/text()', '<root><</root>'))[1]. He expect (escaped) result "<", but gets "<" """ With respect, this "bug" makes no sense as this produces in fact the right result. The actual value of <root> is "<", it's just escaped when serialized to XML. If <root> were to actually contain "<", it'd be serialized as "&lt;". It should not be possible to be blindly cast to a text type, but explicitly serialized as such. At least the reviewer at: http://www.postgresql.org/message-id/201106291934.23089.rsmog...@softperience.eu agrees, but I don't know what happened with that. The python lxml implementation based on libxml2 seems to also agree: >>> from lxml import etree >>> a = etree.XML("<a/>") >>> a.text = "A & B"; >>> a <Element a at 8019eb470> >>> etree.tostring(a); '<a>A & B</a>' >>> a.text 'A & B' >>> a.xpath('/a') [<Element a at 8019eb470>] >>> a.xpath('/a/text()') ['A & B'] and similarly for a simple test using xsltproc when set to output text. If this really is the intended behaviour or something which can or will not be changed, then it invites double (un)escaping bugs and so on, and I would like to ask how you are supposed to sanely extract the intended text from a node in an XML document without risking double (un)escaping, and whether everybody else is doing it wrong? I get that xpath(..text()) apparently wants to return a type XML, that this is on purpose and that there are certain use cases where you want to treat the result as a type XML which you could not do if it returned an unescaped text value, like here: select xmlelement(name "b", (select (select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))))[1])); xmlelement ------------------ <b>A & B</b> which does not double escape the contents, but where if you cast, it does: select xmlelement(name "b", (select (select xpath('/a/text()', (select xmlelement(name "a", 'A & B'))))[1]::text)); xmlelement ---------------------- <b>A &amp; B</b> (1 row) I personally don't believe this is very helpful. The escaping is only a serialization artifact, a text node does not actually contain any &s and so on. My first thought is then that casting between text and xml should not even be possible, and always an explicit (de)serialization to/from text using a chosen encoding (with a shortcut to a PostgreSQL unicode text type), i.e. treated similarly to the difference between a unicode string and utf-8 encoded representation , and not the equivalent of blindly casting a byte sequence to a string and back and hoping for the best. If xpath(..text()) then absolutely has to return a type XML I would be happy to explicitly serialize it to a type text, if PostgreSQL would forbid me from (accidentally) storing a result in my text field I almost certainly did not intend (the escaped value containing &). Of course my first preference would be that it would return a type TEXT. I appreciate any thoughts and workarounds. I don't really want to add xml unescapes everywhere, that feels like that php method of unescaping a string until it stops changing. If the user did intend the literal text "&" I of course want to preserve that. Many thanks! -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs