harrelson wrote:
I have a large amount of data in a postgresql database with the encoding of SQL_ASCII. Most recent data is UTF-8 but data from several years ago could be of some unknown other data type. Being honest with myself, I am not even sure that the most recent data is always UTF-8-- data is entered on web forms and I wouldn't be surprised if data of other encodings is slipping in.
First I would highly recommend to clean up the database and get everything into UTF-8, then re-initdb the cluster with a correct utf-8 locale and database encoding "unicode", then cleanly restore the data. This way the database can make sure further inserts are with the correct encoding and you only have to do the cleanup once - not every time your xml interface gets used. ...
import xml.dom.minidom print chr(3).encode('utf-8') dom = xml.dom.minidom.parseString( "<test>%s</test>" % chr(3).encode('utf-8') )
chr(3) is the ascii character for "end of line". I would think that trying to encode this to utf-8 would fail but it doesn't-- I don't get
Nope, ascii (ord(x) < 128) is contained in utf-8. So 3 is indeed a valid codepoint in utf-8.
a failure till we get into xml land and the parser complains. My question is why doesn't encode() blow up? It seems to me that encode() shouldn't output anything that parseString() can't handle.
It just can't be put literally into XML - this is another step. You basically need to encode into xml charref or have your xml library do so. It seems a little googling turns up this one, which might be helpful: http://www.xml.com/pub/a/2002/11/13/py-xml.html Regards Tino
smime.p7s
Description: S/MIME Cryptographic Signature
-- http://mail.python.org/mailman/listinfo/python-list