harrelson wrote:
I have a large amount of data in a postgresql database with the
encoding of SQL_ASCII.  Most recent data is UTF-8 but data from
several years ago could be of some unknown other data type.  Being
honest with myself, I am not even sure that the most recent data is
always UTF-8-- data is entered on web forms and I wouldn't be
surprised if data of other encodings is slipping in.

First I would highly recommend to clean up the database and get
everything into UTF-8, then re-initdb the cluster with a correct
utf-8 locale and database encoding "unicode", then cleanly restore
the data. This way the database can make sure further inserts
are with the correct encoding and you only have to do the cleanup
once - not every time your xml interface gets used.

...

import xml.dom.minidom
print chr(3).encode('utf-8')
dom = xml.dom.minidom.parseString( "<test>%s</test>" %
chr(3).encode('utf-8') )

chr(3) is the ascii character for "end of line".  I would think that
trying to encode this to utf-8 would fail but it doesn't-- I don't get

Nope, ascii (ord(x) < 128) is contained in utf-8. So 3 is indeed
a valid codepoint in utf-8.

a failure till we get into xml land and the parser complains.  My
question is why doesn't encode() blow up?  It seems to me that
encode() shouldn't output anything that parseString() can't handle.

It just can't be put literally into XML - this is another step.
You basically need to encode into xml charref or have your xml library
do so.

It seems a little googling turns up this one, which might be helpful:

http://www.xml.com/pub/a/2002/11/13/py-xml.html

Regards
Tino

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to