[your reply appears to have come only to me instead of the mailing list; CC'ing c.l.p in reply]

On 05/06/2010 10:12 PM, James Cai wrote:
When you say "This does a replacement for every word in the input corpus
(possibly with itself), but only takes one pass through the source text. "
It sounds great, but I am kinda lost with your code, sorry I am a regex
newbie.

calling statement

results = r.sub(replacer, content)

the replacer is a function that needs parameter. How does the replacer know
what parameter?

The documentation on the .sub() method says that the replacement can either be some text (as you used) or something callable (in my case a function, but could be an object with a __call__ method too), and that this callable is passed the match-object that's found.

why is r = re.compile(r'\b[a-zA-Z]+\b') when the words i want to find should
be in the word_list?

You don't detail what sorts of words are in your word_list.keys() but you'd want your pattern to match those. My first regexp (that you quote) matches single words, but rather loosely (thus my caveat about "replace[s] every word in the input"). The replacement function checks to see if the replacement is in your word-list, and does the replacement, otherwise, it just returns the input. To watch it in action, you can try this:

  d = { # keys are all lowercase
    'hello': 'goodbye',
    'world': 'Python',
    }

  def replacer(match):
    text = match.group(0)
    replacement = d.get(text.lower(), text)

    # see what we're doing for explanation purposes
    print "Replacing %r with %r" % (text, replacement)

    return replacement

  r = re.compile(r'\b[a-zA-Z]+\b')
  print r.sub(replacer, "Hello there world, this is a test")

Is the match here the match of regex or just a variable name?

If the keys in your word_list are more than just words, then the regexp may not find them all, and thus not replace them all. In that case you may have to resort to my 2nd regexp which builds the 5k branch regexp from your actual dictionary keys:

 r = re.compile(r'\b(%s)\b' % (
   '|'.join(re.escape(s) for s in words_list.keys())
   ),
   re.IGNORECASE)

This method on the above dictionary (modified)

  d = {
    'hello': 'goodbye',
    'world': 'python',
    'stuff with spaces?': 'tadah!',
    }

would create a regexp of

  \b(hello|world|stuff\ with\ spaces\?)\b

This has considerable performance implications as len(word_list) grows, unless you can figure a way to determine that some replacements are more probable than others and push them to the front of this regexp, but that's more complex and requires knowledge of your word-list.

However, if all your keys are simply alpha (or alphanumeric, or findable by a simple regexp; likely one that doesn't include whitespace), you'll likely get much better performance with a generic regexp that over-captures, tries to find a replacement in your dict, returning that as the replacement; or if it's not in the dict, returning the original text unchanged. My simple test would be:

  test_regex = r'\w+'
  r = re.compile(r'^\b%s\b$' % test_regex)
  # added the "^....$" to anchor for testing purposes

  for key in word_list: # keys by default
    if not r.match(key):
      print "Failed to match %r" % key
      break

If this passes, then the regexp should likely be sufficient to capture everything needed to use my replacer() function above.

-tkc


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to