Steven Bethard wrote: > I have some plain text data and some SGML markup for that text that I > need to align. (The SGML doesn't maintain the original whitespace, so I > have to do some alignment; I can't just calculate the indices directly.) > For example, some of my text looks like: > > TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in > cytoplasmic translocation and concomitant formation of an intracellular > signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl. > > And the corresponding SGML looks like: > > <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1 > </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1 > </PROTEIN> , resulting in cytoplasmic translocation and concomitant > formation of an <PROTEIN> intracellular signaling complex </PROTEIN> > comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> , > <PROTEIN> TRAF2 </PROTEIN> , and AIPl . > > Note that the SGML inserts spaces not only within the SGML elements, but > also around punctuation. > > > I need to determine the indices in the original text that each SGML > element corresponds to. Here's some working code to do this, based on a > suggestion for a related problem by Fredrik Lundh[1]:: > > def align(text, sgml): > sgml = sgml.replace('&', '&') > tree = etree.fromstring('<xml>%s</xml>' % sgml) > words = [] > if tree.text is not None: > words.extend(tree.text.split()) > word_indices = [] > for elem in tree: > elem_words = elem.text.split() > start = len(words) > end = start + len(elem_words) > word_indices.append((start, end, elem.tag)) > words.extend(elem_words) > if elem.tail is not None: > words.extend(elem.tail.split()) > expr = '\s*'.join('(%s)' % re.escape(word) for word in words) > match = re.match(expr, text) > assert match is not None > for word_start, word_end, label in word_indices: > start = match.start(word_start + 1) > end = match.end(word_end) > yield label, start, end > [...] > >>> list(align(text, sgml)) > [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43), > ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178), > ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)] > > The problem is, this doesn't work when my text is long (which it is) > because regular expressions are limited to 100 groups. I get an error > like:: [...]
Steve This is probably an abuse of itertools... ---8<--- text = '''TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in cytoplasmic translocation and concomitant formation of an intracellular signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.''' sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation and concomitant formation of an <PROTEIN> intracellular signaling complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl . ''' import itertools as it import string def scan(line): if not line: return line = line.strip() parts = string.split(line, '>', maxsplit=1) return parts[0] def align(txt,sml): i = 0 for k,g in it.groupby(sml.split('<'),scan): g = list(g) if not g[0]: continue text = g[0].split('>')[1]#.replace('\n','') if k.startswith('/'): i += len(text) else: offset = len(text.strip()) yield k, i, i+offset i += offset print list(align(text,sgml)) ------------ [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44), ('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181), ('PROTEIN', 184, 188), ('PROTEIN', 191, 196)] It's off because of the punctuation possibly, can't figure it out. maybe you can tweak it? hth Gerard -- http://mail.python.org/mailman/listinfo/python-list