Tim Arnold schrieb: > Hi, I'm beginning to understand the encode/decode string methods, but I'd > like confirmation that I'm still thinking in the right direction: > > I have a file of latin1 encoded text. Let's say I put one line of that file > into a string variable 'tocline', as follows: > tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n' > > import codecs > tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace') > tocline = tocline.decode('latin1','replace') > tocFile.write(tocline) > tocFile.close() > > What I think is that tocFile is wrapped to insure that anything written to > it is in utf8 > I decode the latin1 string into python's internal unicode encoding and that > gets written out as utf8. > > Questions: > what exactly is the tocline when it's read in with that \xe9 and \xed in the > string? A latin1 encoded string?
Yes. A simple, pure byte-string, that happens to contain bytes which under the latin1-encoding are "correct". > Is my method the right way to write such a line out to a file with utf8 > encoding? Yes. > If I read in the latin1 file using > codecs.open(filename,encoding='latin1') and write out the utf8 file by > opening with > codecs.open(othername,encoding='utf8'), would I no longer have a problem -- > I could just read in latin1 and write out utf8 with no more worries about > encoding? As long as you don't mix bytestrings and only use unicode-objects, you should be fine, yes. Diez -- http://mail.python.org/mailman/listinfo/python-list