Oliver Andrich wrote: > 2005/6/21, Konstantin Veretennicov <[EMAIL PROTECTED]>: > >>It does, as long as headline and caption *can* actually be encoded as >>macroman. After you decode headline from utf-8 it will be unicode and >>not all unicode characters can be mapped to macroman: >> >> >>>>>u'\u0160'.encode('utf8') >> >>'\xc5\xa0' >> >>>>>u'\u0160'.encode('latin2') >> >>'\xa9' >> >>>>>u'\u0160'.encode('macroman') >> >>Traceback (most recent call last): >> File "<stdin>", line 1, in ? >> File "D:\python\2.4\lib\encodings\mac_roman.py", line 18, in encode >> return codecs.charmap_encode(input,errors,encoding_map) >>UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in >>position >> 0: character maps to <undefined> > > > Yes, this and the coersion problems Diez mentioned were the problems I > faced. Now I have written a little cleanup method, that removes the > bad characters from the input
By "bad characters", do you mean characters that are in Unicode but not in MacRoman? By "removes the bad characters", do you mean "deletes", or do you mean "substitutes one or more MacRoman characters"? If all you want to do is torch the bad guys, you don't have to write "a little cleanup method". To leave a tombstone for the bad guys: >>> u'abc\u0160def'.encode('macroman', 'replace') 'abc?def' >>> To leave no memorial, only a cognitive gap: >>> u'The Good Soldier \u0160vejk'.encode('macroman', 'ignore') 'The Good Soldier vejk' Do you *really* need to encode it as MacRoman? Can't the Mac app understand utf8? You mentioned cp850 in an earlier post. What would you be feeding cp850-encoded data that doesn't understand cp1252, and isn't in a museum? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list