On Oct 17, 12:45 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote: > > I need to parse a file, text file. The format is something like that: > > > TYPE1 metadata > > data line 1 > > data line 2 > > ... > > data line N > > TYPE2 metadata > > data line 1 > > ... > > TYPE3 metadata > > ... > > […] > > because when the parser iterates over the input, it can't know that it > > finished processing the section until it reads the next "TYPE" line > > (actually, until it reads the first line that it cannot parse, which if > > everything went well, should be the 'TYPE'), but once it reads it, it is > > no longer available to the outer loop. I wouldn't like to leak the > > internals of the parsers to the outside. > > > What could I do? > > (to the curious: the format is a dialect of the E00 used in GIS) > > Group the lines before processing and feed each group to the right parser: > > import sys > from itertools import groupby, imap > from operator import itemgetter > > def parse_a(metadata, lines): > print 'parser a', metadata > for line in lines: > print 'a', line > > def parse_b(metadata, lines): > print 'parser b', metadata > for line in lines: > print 'b', line > > def parse_c(metadata, lines): > print 'parser c', metadata > for line in lines: > print 'c', line > > def test_for_type(line): > return line.startswith('TYPE') > > def parse(lines): > def tag(): > type_line = None > for line in lines: > if test_for_type(line): > type_line = line > else: > yield (type_line, line) > > type2parser = {'TYPE1': parse_a, > 'TYPE2': parse_b, > 'TYPE3': parse_c } > > for type_line, group in groupby(tag(), itemgetter(0)): > type_id, metadata = type_line.split(' ', 1) > type2parser[type_id](metadata, imap(itemgetter(1), group)) > > def main(): > parse(sys.stdin)
I like groupby and find it very powerful but I think it complicates things here instead of simplifying them. I would instead create a parser instance for every section as soon as the TYPE line is read and then feed it one data line at a time (or if all the data lines must or should be given at once, append them in a list and feed them all as soon as the next section is found), something like: class parse_a(object): def __init__(self, metadata): print 'parser a', metadata def parse(self, line): print 'a', line # similar for parse_b and parse_c # ... def parse(lines): parse = None for line in lines: if test_for_type(line): type_id, metadata = line.split(' ', 1) parse = type2parser[type_id](metadata).parse else: parse(line) George -- http://mail.python.org/mailman/listinfo/python-list