On Wed, 17 Nov 2010 20:21:06 -0800, Sorin Schwimmer wrote: > Hi All, > > I have to eliminate diacritics in a fairly large file.
What's "fairly large"? Large to you is probably not large to your computer. Anything less than a few dozen megabytes is small enough to be read entirely into memory. > Inspired by http://code.activestate.com/recipes/81330/, I came up with > the following code: If all you are doing is replacing single characters, then there's no need for the 80lb sledgehammer of regular expressions when all you need is a delicate tack hammer. Instead of this: * read the file as bytes * search for pairs of bytes like chr(195)+chr(130) using a regex * replace them with single bytes like 'A' do this: * read the file as a Unicode * search for characters like  * replace them with single characters like A using unicode.translate() (or str.translate() in Python 3.x) The only gotcha is that you need to know (or guess) the encoding to read the file correctly. -- Steven -- http://mail.python.org/mailman/listinfo/python-list