I have a string with a bunch of whitespace in it, and a series of chunks of that string whose indices I need to find. However, the chunks have been whitespace-normalized, so that multiple spaces and newlines have been converted to single spaces as if by ' '.join(chunk.split()). Some example data to clarify my problem:
py> text = """\ ... aaa bb ccc ... dd eee. fff gggg ... hh i. ... jjj kk. ... """ py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.'] Note that the original "text" has a variety of whitespace between words, but the corresponding "chunks" have only single space characters between "words". I'm looking for the indices of each chunk, so for this example, I'd like: py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)] Note that the indices correspond to the *original* text so that the substrings in the given spans include the irregular whitespace: py> for s, e in result: ... print repr(text[s:e]) ... 'aaa bb' 'ccc\ndd eee.' 'fff gggg\nhh i.' 'jjj' 'kk.' I'm trying to write code to produce the indices. Here's what I have: py> def get_indices(text, chunks): ... chunks = iter(chunks) ... chunk = None ... for text_index, c in enumerate(text): ... if c.isspace(): ... continue ... if chunk is None: ... chunk = chunks.next().replace(' ', '') ... chunk_start = text_index ... chunk_index = 0 ... if c != chunk[chunk_index]: ... raise Exception('unmatched: %r %r' % ... (c, chunk[chunk_index])) ... else: ... chunk_index += 1 ... if chunk_index == len(chunk): ... yield chunk_start, text_index + 1 ... chunk = None ... And it appears to work: py> list(get_indices(text, chunks)) [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)] py> list(get_indices(text, chunks)) == result True But it seems somewhat inelegant. Can anyone see an easier/cleaner/more Pythonic way[1] of writing this code? Thanks in advance, STeVe [1] Yes, I'm aware that these are subjective terms. I'm looking for subjectively "better" solutions. ;) -- http://mail.python.org/mailman/listinfo/python-list