On 01.10.09 17:50, Rami Chowdhury wrote: > On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald > <wal...@livinglogic.de> wrote: > >> On 01.10.09 16:09, Hyuga wrote: >>> On Sep 30, 3:34 am, gentlestone <tibor.b...@hotmail.com> wrote: >>>> Why don't work this code on Python 2.6? Or how can I do this job? >>>> >>>> [snip _MAP] >>>> >>>> def downcode(name): >>>> """ >>>> >>> downcode(u"Žabovitá zmiešaná kaša") >>>> u'Zabovita zmiesana kasa' >>>> """ >>>> for key, value in _MAP.iteritems(): >>>> name = name.replace(key, value) >>>> return name >>> >>> Though C Python is pretty optimized under the hood for this sort of >>> single-character replacement, this still seems pretty inefficient >>> since you're calling replace for every character you want to map. I >>> think that a better approach might be something like: >>> >>> def downcode(name): >>> return ''.join(_MAP.get(c, c) for c in name) >>> >>> Or using string.translate: >>> >>> import string >>> def downcode(name): >>> table = string.maketrans( >>> 'ÀÁÂÃÄÅ...', >>> 'AAAAAA...') >>> return name.translate(table) >> >> Or even simpler: >> >> import unicodedata >> >> def downcode(name): >> return unicodedata.normalize("NFD", name)\ >> .encode("ascii", "ignore")\ >> .decode("ascii") >> >> Servus, >> Walter > > As I understand it, the "ignore" argument to str.encode *removes* the > undecodable characters, rather than replacing them with an ASCII > approximation. Is that correct? If so, wouldn't that rather defeat the > purpose?
Yes, but any accented characters have been split into the base character and the combining accent via normalize() before, so only the accent gets removed. Of course non-decomposable characters will be removed completely, but it would be possible to replace .encode("ascii", "ignore").decode("ascii") with something like this: u"".join(c for c in name if unicodedata.category(c) == "Mn") Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list