On Oct 4, 7:41 am, harrelson <[EMAIL PROTECTED]> wrote: > I have a large amount of data in a postgresql database with the > encoding of SQL_ASCII. Most recent data is UTF-8 but data from > several years ago could be of some unknown other data type. Being > honest with myself, I am not even sure that the most recent data is > always UTF-8-- data is entered on web forms and I wouldn't be > surprised if data of other encodings is slipping in. > > Up to the point I have just ignored the problem-- on the web side of > things everything works good enough. But now I am required to stuff > this data into xml datasets and I am, of course, having problems. My > preference would be to force the data into UTF-8 even if it is > ultimately an incorrect encoding translation but this isn't working. > The below code represents my most recent problem: > > import xml.dom.minidom > print chr(3).encode('utf-8') > dom = xml.dom.minidom.parseString( "<test>%s</test>" % > chr(3).encode('utf-8') ) > > chr(3) is the ascii character for "end of line". I would think that > trying to encode this to utf-8 would fail but it doesn't-- I don't get > a failure till we get into xml land and the parser complains. My > question is why doesn't encode() blow up? It seems to me that > encode() shouldn't output anything that parseString() can't handle.
The encode method is doing its job, which is to encode ANY and EVERY unicode character as utf-8, so that it can be transported reliably over an 8-bit-wide channel. encode is *not* supposed to guess what you are going to do with the output. Perhaps instead of "forcing the data into utf-8", you should be thinking about what is actually in your data. What is the context that chr(3) appears in? Perhaps when you get around to print repr(some_data), you might see things like "\x03harlie \x03haplin" -- it's a common enough keyboarding error to hit the Ctrl key instead of the Shift key and unfortunately a common-enough design error for there to be no checking at all. BTW, there's no forcing involved -- chr(3) is *already* utf-8. HTH, John -- http://mail.python.org/mailman/listinfo/python-list