Fredrik Lundh wrote: > thebjorn wrote: > > > I've got a database (ms sqlserver) that's (way) out of my control, > > where someone has stored utf-8 encoded Unicode data in regular varchar > > fields, so that e.g. the string 'Blåbærsyltetøy' is in the database > > as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/ .. > first, check if you can get your database adapter to understand that the > database contains UTF-8 and not ISO-8859-1.
It would be the way to go, however it looks like they've managed to get Latin-1 data in exactly two columns in the entire database (this is a commercial product of course, so there's no way for us to fix things). And just to make things more interesting, I think I'm running into an ADO bug where capital letters (outside the U+0000 to U+007F range) are returning strange values: >>> c.execute('create table utf8 (f1 varchar(15))') >>> u'ÆØÅÉ'.encode('utf-8') '\xc3\x86\xc3\x98\xc3\x85\xc3\x89' >>> x = _ >>> c.execute('insert into utf8 (f1) values (?)', (x,)) >>> c.execute('select * from utf8') >>> c.fetchall() ((u'\xc3\u2020\xc3\u02dc\xc3\u2026\xc3\u2030',),) >>> I haven't tested this through C[#/++] to verify that it's an ADO issue, but it seems unlikely that MS would view this as anything but incorrect usage no matter where the issue is... Anyway, sorry for venting :-) > if that's not possible, you can roundtrip via ISO-8859-1 yourself: > > >>> u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' ... > >>> print u.encode("iso-8859-1").decode("utf-8") > Blåbærsyltetøy That's very nice! -- bjorn -- http://mail.python.org/mailman/listinfo/python-list