"Diez B. Roggisch" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Tim Arnold schrieb: >> Hi, I'm beginning to understand the encode/decode string methods, but I'd >> like confirmation that I'm still thinking in the right direction: >> >> I have a file of latin1 encoded text. Let's say I put one line of that >> file into a string variable 'tocline', as follows: >> tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n' >> >> import codecs >> tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace') >> tocline = tocline.decode('latin1','replace') >> tocFile.write(tocline) >> tocFile.close() >> >> What I think is that tocFile is wrapped to insure that anything written >> to it is in utf8 >> I decode the latin1 string into python's internal unicode encoding and >> that gets written out as utf8. >> >> Questions: >> what exactly is the tocline when it's read in with that \xe9 and \xed in >> the string? A latin1 encoded string? > > Yes. A simple, pure byte-string, that happens to contain bytes which under > the latin1-encoding are "correct". > >> Is my method the right way to write such a line out to a file with utf8 >> encoding? > > Yes. > >> If I read in the latin1 file using >> codecs.open(filename,encoding='latin1') and write out the utf8 file by >> opening with >> codecs.open(othername,encoding='utf8'), would I no longer have a >> problem -- I could just read in latin1 and write out utf8 with no more >> worries about encoding? > > As long as you don't mix bytestrings and only use unicode-objects, you > should be fine, yes. > > Diez
wow, I was thinking correctly about encoding! time for a beer! Diez, thanks very much for confirming my thoughts. --Tim Arnold -- http://mail.python.org/mailman/listinfo/python-list