Gerard Flanagan wrote: > Steven Bethard wrote: >> I have some plain text data and some SGML markup for that text that I >> need to align. (The SGML doesn't maintain the original whitespace, so I >> have to do some alignment; I can't just calculate the indices directly.) >> For example, some of my text looks like: >> >> TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in >> cytoplasmic translocation and concomitant formation of an intracellular >> signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl. >> >> And the corresponding SGML looks like: >> >> <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1 >> </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1 >> </PROTEIN> , resulting in cytoplasmic translocation and concomitant >> formation of an <PROTEIN> intracellular signaling complex </PROTEIN> >> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> , >> <PROTEIN> TRAF2 </PROTEIN> , and AIPl . >> >> Note that the SGML inserts spaces not only within the SGML elements, but >> also around punctuation. >> >> >> I need to determine the indices in the original text that each SGML >> element corresponds to. Here's some working code to do this, based on a >> suggestion for a related problem by Fredrik Lundh[1]:: >> >> def align(text, sgml): >> sgml = sgml.replace('&', '&') >> tree = etree.fromstring('<xml>%s</xml>' % sgml) >> words = [] >> if tree.text is not None: >> words.extend(tree.text.split()) >> word_indices = [] >> for elem in tree: >> elem_words = elem.text.split() >> start = len(words) >> end = start + len(elem_words) >> word_indices.append((start, end, elem.tag)) >> words.extend(elem_words) >> if elem.tail is not None: >> words.extend(elem.tail.split()) >> expr = '\s*'.join('(%s)' % re.escape(word) for word in words) >> match = re.match(expr, text) >> assert match is not None >> for word_start, word_end, label in word_indices: >> start = match.start(word_start + 1) >> end = match.end(word_end) >> yield label, start, end >> > [...] >> >>> list(align(text, sgml)) >> [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43), >> ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178), >> ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)] >> >> The problem is, this doesn't work when my text is long (which it is) >> because regular expressions are limited to 100 groups. I get an error >> like:: > [...] > > Steve > > This is probably an abuse of itertools... > > ---8<--- > text = '''TNF binding induces release of AIP1 (DAB2IP) from > TNFR1, resulting in cytoplasmic translocation and concomitant > formation of an intracellular signaling complex comprised of TRADD, > RIP1, TRAF2, and AIPl.''' > > sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of > <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from > <PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation > and concomitant formation of an <PROTEIN> intracellular signaling > complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> , > <PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl . > ''' > > import itertools as it > import string > > def scan(line): > if not line: return > line = line.strip() > parts = string.split(line, '>', maxsplit=1) > return parts[0] > > def align(txt,sml): > i = 0 > for k,g in it.groupby(sml.split('<'),scan): > g = list(g) > if not g[0]: continue > text = g[0].split('>')[1]#.replace('\n','') > if k.startswith('/'): > i += len(text) > else: > offset = len(text.strip()) > yield k, i, i+offset > i += offset > > print list(align(text,sgml)) > > ------------ > > [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44), > ('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181), > ('PROTEIN', 184, 188), ('PROTEIN', 191, 196)] > > It's off because of the punctuation possibly, can't figure it out.
Thanks for taking a look. Yeah, the alignment's a big part of the problem. It'd be really nice if the thing that gives me SGML didn't add whitespace haphazardly. ;-) STeVe -- http://mail.python.org/mailman/listinfo/python-list