Steven Bethard wrote: > I've got a list of word substrings (the "tokens") which I need to align > to a string of text (the "sentence"). The sentence is basically the > concatenation of the token list, with spaces sometimes inserted beetween > tokens. I need to determine the start and end offsets of each token in > the sentence. For example:: > > py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] > py> text = '''\ > ... She's gonna write > ... a book?''' > py> list(offsets(tokens, text)) > [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] > > Here's my current definition of the offsets function:: > > py> def offsets(tokens, text): > ... start = 0 > ... for token in tokens: > ... while text[start].isspace(): > ... start += 1 > ... text_token = text[start:start+len(token)] > ... assert text_token == token, (text_token, token) > ... yield start, start + len(token) > ... start += len(token) > ... > > I feel like there should be a simpler solution (maybe with the re > module?) but I can't figure one out. Any suggestions? > > STeVe
Hi Steve: Any reason you can't simply use str.find in your offsets function? >>> def offsets(tokens, text): ... ptr = 0 ... for token in tokens: ... fpos = text.find(token, ptr) ... if fpos != -1: ... end = fpos + len(token) ... yield (fpos, end) ... ptr = end ... >>> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] >>> and then, for an entry in the wacky category, a difflib solution: >>> def offsets(tokens, text): ... from difflib import SequenceMatcher ... s = SequenceMatcher(None, text, "\t".join(tokens)) ... for start, _, length in s.get_matching_blocks(): ... if length: ... yield start, start + length ... >>> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] >>> cheers Michael -- http://mail.python.org/mailman/listinfo/python-list