Re: fast regex

Tim Chase Thu, 06 May 2010 19:51:45 -0700

On 05/06/2010 09:11 PM, james_027 wrote:

for key, value in words_list.items():
     compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
     search = compile.sub(value, content)


where the content is a large text about 500,000 characters and the
word list is about 5,000

You don't specify what you want to do with "search" vs."content"...are you then reassigning


  content = search

so that subsequent replacements happen? (your current versioncreates "search", only to discard it)

My first thought would be to make use of re.sub()'s ability totake a function and do something like


  # a regexp that finds all possible
  # matches/words of interest
  r = re.compile(r'\b[a-zA-Z]+\b')
  def replacer(match):
    text = match.group(0)
    # assuming your dict.keys() are all lowercase:
    return word_list.get(text.lower(), text)
  results = r.sub(replacer, content)

This does a replacement for every word in the input corpus(possibly with itself), but only takes one pass through thesource text. If you wanted to get really fancy (and didn't buttup against the max size for a regexp), I suppose you could dosomething like


  r = re.compile(r'\b(%s)\b' % (
    '|'.join(re.escape(s) for s in words_list.keys())),
    re.IGNORECASE)
  def replacer(match):
    return word_list[match.group(0).lower()] # assume lower keys
  results = r.sub(replacer, content)

which would only do replacements on your keys rather than every"word" in your input, but I'd start with the first version beforeabusing programmatic regexp generation.


-tkc





--
http://mail.python.org/mailman/listinfo/python-list

Re: fast regex

Reply via email to