Fredrik Lundh wrote: > John Machin wrote: > > > Another point: there are many non-latin1 characters that could be > > mapped to ASCII. For example: > > u"\u0141ukasziewicz".translate(unaccented_map()) > > doesn't work unless an entry is added to the no-decomposition table: > > 0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE > > > > It looks like generating extra entries like that could be done, with > > the aid of unicodedata.name(): > > > > LATIN CAPITAL LETTER X WITH blahblah -> "X" > > LATIN SMALL LETTER X WITH blahblah -> "X".lower() > > > > This would require a fair bit of care -- obviously there are special > > cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional > > experts is probably required. > > see the comments over at > > http://effbot.org/zone/unicode-convert.htm
Don't rush me, I was getting to that next :-) > > for an extended table, eyeballed by a regional expert (and since he > makes the same point about OE vs Oe as you do, I'll probably have to > change the code ;-) > Slightly extended. My point is that there is a large number of LATIN (CAPITAL|SMALL) LETTER X WITH twiddly-bits that don't have a decomposition; the table entries could be generated automatically As well as regional experts, Google can be handy: googling for Thord, Thordh, Thordsson and Thordhsson and noting the number of hits for each tends to indicate that you and I are right about the treatment of "eth"; Marcin's "dh" might better indicate how it's pronounced, but "d" is AFAICT the standard transcription. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list