Paul McGuire wrote: > "Steven Bethard" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > >>I've got a list of word substrings (the "tokens") which I need to align >>to a string of text (the "sentence"). The sentence is basically the >>concatenation of the token list, with spaces sometimes inserted beetween >>tokens. I need to determine the start and end offsets of each token in >>the sentence. For example:: >> >>py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] >>py> text = '''\ >>... She's gonna write >>... a book?''' >>py> list(offsets(tokens, text)) >>[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] > > =================== > from pyparsing import oneOf > > tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] > text = '''\ > She's gonna write > a book?''' > > tokenlist = oneOf( " ".join(tokens) ) > offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ] > > print offsets > =================== > [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
Now that's a pretty solution. Three cheers for pyparsing! :) STeVe -- http://mail.python.org/mailman/listinfo/python-list