Fredrik Lundh wrote: > Steven Bethard wrote: > > >>>>I feel like there should be a simpler solution (maybe with the re >>>>module?) but I can't figure one out. Any suggestions? >>> >>>using the finditer pattern I just posted in another thread: >>> >>>tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] >>>text = '''\ >>>She's gonna write >>>a book?''' >>> >>>import re >>> >>>tokens.sort() # lexical order >>>tokens.reverse() # look for longest match first >>>pattern = "|".join(map(re.escape, tokens)) >>>pattern = re.compile(pattern) >>> >>>I get >>> >>>print [m.span() for m in pattern.finditer(text)] >>>[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] >>> >>>which seems to match your version pretty well. >> >>That's what I was looking for. Thanks! > > > except that I misread your problem statement; the RE solution above allows the > tokens to be specified in arbitrary order. if they've always ordered, you > can re- > place the code with something like: > > # match tokens plus optional whitespace between each token > pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens) > m = re.match(pattern, text) > result = (m.span(i+1) for i in range(len(tokens))) > > which is 6-7 times faster than the previous solution, on my machine.
Ahh yes, that's faster for me too. Thanks again! STeVe -- http://mail.python.org/mailman/listinfo/python-list