Re: fast regex

Tim Chase Fri, 07 May 2010 05:38:14 -0700

[your reply appears to have come only to me instead of themailing list; CC'ing c.l.p in reply]


On 05/06/2010 10:12 PM, James Cai wrote:

When you say "This does a replacement for every word in the input corpus
(possibly with itself), but only takes one pass through the source text. "
It sounds great, but I am kinda lost with your code, sorry I am a regex
newbie.


calling statement

results = r.sub(replacer, content)

the replacer is a function that needs parameter. How does the replacer know
what parameter?

The documentation on the .sub() method says that the replacementcan either be some text (as you used) or something callable (inmy case a function, but could be an object with a __call__ methodtoo), and that this callable is passed the match-object that's found.

why is r = re.compile(r'\b[a-zA-Z]+\b') when the words i want to find should
be in the word_list?

You don't detail what sorts of words are in your word_list.keys()but you'd want your pattern to match those. My first regexp(that you quote) matches single words, but rather loosely (thusmy caveat about "replace[s] every word in the input"). Thereplacement function checks to see if the replacement is in yourword-list, and does the replacement, otherwise, it just returnsthe input. To watch it in action, you can try this:


  d = { # keys are all lowercase
    'hello': 'goodbye',
    'world': 'Python',
    }

  def replacer(match):
    text = match.group(0)
    replacement = d.get(text.lower(), text)

    # see what we're doing for explanation purposes
    print "Replacing %r with %r" % (text, replacement)

    return replacement

  r = re.compile(r'\b[a-zA-Z]+\b')
  print r.sub(replacer, "Hello there world, this is a test")

Is the match here the match of regex or just a variable name?

If the keys in your word_list are more than just words, then theregexp may not find them all, and thus not replace them all. Inthat case you may have to resort to my 2nd regexp which buildsthe 5k branch regexp from your actual dictionary keys:

 r = re.compile(r'\b(%s)\b' % (
   '|'.join(re.escape(s) for s in words_list.keys())
   ),
   re.IGNORECASE)


This method on the above dictionary (modified)

  d = {
    'hello': 'goodbye',
    'world': 'python',
    'stuff with spaces?': 'tadah!',
    }

would create a regexp of

  \b(hello|world|stuff\ with\ spaces\?)\b

This has considerable performance implications as len(word_list)grows, unless you can figure a way to determine that somereplacements are more probable than others and push them to thefront of this regexp, but that's more complex and requiresknowledge of your word-list.

However, if all your keys are simply alpha (or alphanumeric, orfindable by a simple regexp; likely one that doesn't includewhitespace), you'll likely get much better performance with ageneric regexp that over-captures, tries to find a replacement inyour dict, returning that as the replacement; or if it's not inthe dict, returning the original text unchanged. My simple testwould be:


  test_regex = r'\w+'
  r = re.compile(r'^\b%s\b$' % test_regex)
  # added the "^....$" to anchor for testing purposes

  for key in word_list: # keys by default
    if not r.match(key):
      print "Failed to match %r" % key
      break

If this passes, then the regexp should likely be sufficient tocapture everything needed to use my replacer() function above.


-tkc


--
http://mail.python.org/mailman/listinfo/python-list

Re: fast regex

Reply via email to