On 24/03/2006 8:11 PM, Duncan Booth wrote: > Peter Otten wrote: > > >>>You can replace ALL of this upshifting and accent removal in one blow >>>by using the string translate() method with a suitable table. >> >>Only if you convert to unicode first or if your data maintains 1 byte >>== 1 character, in particular it is not UTF-8. >> > > > There's a nice little codec from Skip Montaro for removing accents from
For the benefit of those who may read only this far, it is NOT nice. > latin-1 encoded strings. It also has an error handler so you can convert > from unicode to ascii and strip all the accents as you do so: > > http://orca.mojam.com/~skip/python/latscii.py > > >>>>import latscii >>>>import htmlentitydefs >>>>print u'\u00c9'.encode('ascii','replacelatscii') > > E > > > So Bussiere could replace a large chunk of his code with: Could, but definitely shouldn't. > > ligneA = ligneA.decode(INPUTENCODING).encode('ascii', 'replacelatscii') > ligneA = ligneA.upper() > > INPUTENCODING is 'utf8' unless (one possible explanation for his problem) > his files are actually in some different encoding. > > Unfortunately, just as I finished writing this I discovered that the > latscii module isn't as robust as I thought, it blows up on consecutive > accented characters. > > :( > Some of the transformations are a little unfortunate :-( 0x00d0: ord('D'), # Ð 0x00f0: ord('o'), # ð Icelandic capital eth becomes D, OK; but the small letter becomes o!!! The Icelandic thorn letters become P & p (based on physical appearance), when they should become Th and th. The German letter Eszett (00DF) becomes B (appearance) when it should be ss. Creating alphabetics out of punctuation is scarcely something that bussiere should be interested in: 0x00a2: ord('c'), # ¢ 0x00a4: ord('o'), # ¤ 0x00a5: ord('Y'), # ¥ 0x00a7: ord('S'), # § 0x00a9: ord('c'), # © 0x00ae: ord('R'), # ® 0x00b6: ord('P'), # ¶ -- http://mail.python.org/mailman/listinfo/python-list