Nick Matzke <matzke <at> berkeley.edu> writes: > > > Looks like this was a solution: > > 1. Use this guy's unescape function to convert from HTML/XML Entities to > unicode > http://effbot.org/zone/re-sub.htm#unescape-html
Looks like you didn't notice "this guy"'s unaccent.py :-) http://effbot.org/zone/unicode-convert.htm [Aside: Has anyone sighted the effbot recently? He's been very quiet.] > 2. Take the unicode and convert to approximate plain ASCII matches with > unicodedata (after import unicodedata) > > ascii_content2 = unescape(line) > > ascii_content = unicodedata.normalize('NFKD', > unicode(ascii_content2)).encode('ascii','ignore') The normalize hack gets you only so far. Many Latin-based characters are not decomposable. Look for the thread in this newsgroup with subject "convert unicode characters to visibly similar ascii characters" around 2008-07-01 or google("hefferon unicode2ascii") Alternative: If you told us which platform you are running on, people familiar with that platform could help you set up your terminal to display non-ASCII characters correctly. HTH, John -- http://mail.python.org/mailman/listinfo/python-list