metaperl wrote: > There is no end to the number of frantic pleas for help with > characters in the realm beyond ASCII.
And the answer is "first decode to unicode, then modify" in nine out of ten cases. > However, in searching thru them, I do not see a workable approach to > changing them into other things. > > I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS- > AMHERST" ... in other words, there is a dash between MASSACHUSETTS and > AMHERST. > > However, if I do a grep for the text the shell returns this: > > MASSACHUSETTS–AMHERST > > and od -tc returns this: > > 0000540 O F M A S S A C H U S E T > T > 0000560 S 342 200 223 A M H E R S T ; U N > I > > > So, the conclusion is the "dash" is actually 3 octal characters. My > goal is to take those 3 octal characters and convert them to an ascii > dash. Any idea how I might write such a filter? The closest I have got > it: > > unicodedata.normalize('NFKD', s).encode('ASCII', 'replace') > > but that puts a question mark there. No idea where the character references come from but the dump suggests that your text is in UTF-8. >>> "MASSACHUSETS\342\200\223AMHERST".decode("utf8") u'MASSACHUSETS\u2013AMHERST' >>> "MASSACHUSETS\342\200\223AMHERST".decode("utf8").replace(u"\u2013", "-") u'MASSACHUSETS-AMHERST' u"\2013" is indeed a dash, by the way: >>> import unicodedata >>> unicodedata.name(u"\u2013") 'EN DASH' Peter -- http://mail.python.org/mailman/listinfo/python-list