Re: String multi-replace

Steven D'Aprano Wed, 17 Nov 2010 21:18:08 -0800

On Wed, 17 Nov 2010 20:21:06 -0800, Sorin Schwimmer wrote:

> Hi All,
> 
> I have to eliminate diacritics in a fairly large file.


What's "fairly large"? Large to you is probably not large to your 
computer. Anything less than a few dozen megabytes is small enough to be 
read entirely into memory.



> Inspired by http://code.activestate.com/recipes/81330/, I came up with
> the following code:

If all you are doing is replacing single characters, then there's no need 
for the 80lb sledgehammer of regular expressions when all you need is a 
delicate tack hammer. Instead of this:

* read the file as bytes
* search for pairs of bytes like chr(195)+chr(130) using a regex
* replace them with single bytes like 'A'

do this:

* read the file as a Unicode 
* search for characters like Â
* replace them with single characters like A using unicode.translate()

(or str.translate() in Python 3.x)


The only gotcha is that you need to know (or guess) the encoding to read 
the file correctly.



-- 
Steven

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String multi-replace

Reply via email to