On 05/06/2010 09:11 PM, james_027 wrote:
for key, value in words_list.items():
     compile = re.compile(r"""\b%s\b""" % key, re.IGNORECASE)
     search = compile.sub(value, content)

where the content is a large text about 500,000 characters and the
word list is about 5,000

You don't specify what you want to do with "search" vs. "content"...are you then reassigning

  content = search

so that subsequent replacements happen? (your current version creates "search", only to discard it)

My first thought would be to make use of re.sub()'s ability to take a function and do something like

  # a regexp that finds all possible
  # matches/words of interest
  r = re.compile(r'\b[a-zA-Z]+\b')
  def replacer(match):
    text = match.group(0)
    # assuming your dict.keys() are all lowercase:
    return word_list.get(text.lower(), text)
  results = r.sub(replacer, content)

This does a replacement for every word in the input corpus (possibly with itself), but only takes one pass through the source text. If you wanted to get really fancy (and didn't butt up against the max size for a regexp), I suppose you could do something like

  r = re.compile(r'\b(%s)\b' % (
    '|'.join(re.escape(s) for s in words_list.keys())),
    re.IGNORECASE)
  def replacer(match):
    return word_list[match.group(0).lower()] # assume lower keys
  results = r.sub(replacer, content)

which would only do replacements on your keys rather than every "word" in your input, but I'd start with the first version before abusing programmatic regexp generation.

-tkc





--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to