Tim Chase wrote: > James wrote: > > [Tim had written:] > If the keys in your word_list are more than just words, then the > regexp may not find them all, and thus not replace them all. In > that case you may have to resort to my 2nd regexp which builds > the 5k branch regexp from your actual dictionary keys: > > >> r = re.compile(r'\b(%s)\b' % ( > >> '|'.join(re.escape(s) for s in words_list.keys()) > >> ), > >> re.IGNORECASE) > > This method on the above dictionary (modified) > > d = { > 'hello': 'goodbye', > 'world': 'python', > 'stuff with spaces?': 'tadah!', > } > > would create a regexp of > > \b(about|world|stuff\ with\ spaces\?)\b
There's a subtle issue of the order of the keywords. Pythons (ir)regular expressions do not strictly guarantee to find the longest match. For example, re.findall(r'\b(about|about\-faced)\b', 'does about-faced match?') returns ['about'], which is the first complete match, not the longest. Sorting the keywords into longest-first order avoids the problem. > This has considerable performance implications as len(word_list) > grows, unless you can figure a way to determine that some > replacements are more probable than others and push them to the > front of this regexp, but that's more complex and requires > knowledge of your word-list. There is another method which is to factor out common prefixes, so the re module's search-and-backtrack engine never has a choice of multiple paths. Instead of: \b(abolished|abolition)\b we'd use: \b(aboli(shed|tion)\b The RE-generator is of course more complex than Tim's one-liner, but I happen to have code, which I'll include below. It's probably academic, as I'd agree with Tim that his first solution is the better candidate at this point. With my giant prefix-combined RE's, I can re.compile one with 4000 words randomly chosen from a Unix words file, but 5000 results in "regular expression code size limit exceeded". Tim's version which doesn't combine prefixes tops out a little lower. This is on 32-bit Windows, standard distribution. One could, of course, build multiple smaller RE's, but that starts to suck. -Bryan Olson from random import sample import re def build_big_re(str_list): """ Build a non-backtracking regular expression matching any of the words in str_list. """ def trie_insert(table, str): if str: submap = table.setdefault(str[0], {}) trie_insert(submap, str[1:]) else: table[""] = {} def sequentialize(table, lst): if table: keys = table.keys() keys.sort(key=len, reverse=True) lst.append('(?:') seperator = '' for key in keys: lst.append(seperator + re.escape(key)) submap = table[key] while len(submap) == 1: key = submap.keys()[0] submap = submap[key] lst.append(re.escape(key)) sequentialize(submap, lst) seperator = '|' lst.append(')') table = {} for str in str_list: trie_insert(table, str) lst = [r'\b'] sequentialize(table, lst) lst.append(r'\b') re_str = "".join(lst) # print "RE is: '%s'\n" % re_str return re.compile(re_str, re.IGNORECASE) def simple_re(str_list): str_list = sorted(str_list, key=len, reverse=True) re_str = r'\b(%s)\b' % ( '|'.join(re.escape(s) for s in str_list)) # print "RE is", re_str return re.compile(re_str, re.IGNORECASE) def testy(): words_file = r'/usr/share/dict/words' # Common Unix # words_file = r'C:\z\wordlist.txt' # Just my box nkeywords = 3500 from random import sample with open(words_file, 'r') as f: allwords = [w.strip() for w in f.readlines()] keywords = sample(allwords, nkeywords) bigtext = ' '.join(allwords) print 'Building simple RE...' simpre = simple_re(keywords) print 'Run simple re...' z1 = simpre.findall(bigtext) print 'Done.\n' print 'Building complex RE...' compre = build_big_re(keywords) print 'Run complex re...' z2 = compre.findall(bigtext) print 'Done.' assert z1 == z2 testy() -- http://mail.python.org/mailman/listinfo/python-list