On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard <[EMAIL PROTECTED]> wrote:
>I have a list of strings that looks something like: > ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X'] >I need to group the strings into runs (lists) using the following rules >based on the string prefix: > 'O' is discarded > 'B_...' starts a new run > 'I_...' continues a run started by a 'B_...' >So, the example above should look like: > [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']] > >At the same time that I'm extracting the runs, it's important that I >check for errors as well. 'I_...' must always follow 'B_...', so errors >look like: > ['O', 'I_...'] > ['B_xxx', 'I_yyy'] >where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the >'B_...' is different from the suffix of the 'I_...'. With error checks on predecessor relationship, I think I'd do the whole thing in a generator, doing my own groupby as I went. E.g., see if this does what you want (slightly different error checking): >>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X'] >>> def get_runs(seq): ... subseq =[] ... curr = '<NO PRIOR ELEMENTS>' ... for latest in seq: ... curr, last = latest, curr ... if curr.startswith('B_'): ... if subseq: yield subseq ... subseq = [curr] ... elif curr.startswith('I_'): ... if (last[:2] not in ('B_', 'I_') or ... last[2:] != curr[2:] ... ): raise ValueError, '%r followed by %r'%(last, curr) ... subseq.append(curr) ... elif curr!='O': ... raise ValueError, 'Unrecognized element: %r' % curr ... if subseq: yield subseq ... >>> list(get_runs(L)) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']] But note that I allowed multiple I_X, did you want to do that? >>> list(get_runs('B_X I_X I_X'.split())) [['B_X', 'I_X', 'I_X']] Did you want all these "errors" caught? >>> list(get_runs('B_X I_X ?_X'.split())) Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 15, in get_runs ValueError: Unrecognized element: '?_X' >>> list(get_runs('I_X I_X ?_X'.split())) Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 12, in get_runs ValueError: '<NO PRIOR ELEMENTS>' followed by 'I_X' >>> list(get_runs('B_X I_Y ?_X'.split())) Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 12, in get_runs ValueError: 'B_X' followed by 'I_Y' Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such) > >This is the best I've come up with so far: > >py> class K(object): >... def __init__(self): >... self.last_result = False >... self.last_label = 'O' >... def __call__(self, label): >... if label[:2] in ('O', 'B_'): >... self.last_result = not self.last_result Thanks for nice reminder that groupby doesn't need distinct keys for all groups, just different from last ;-) >... elif self.last_label[2:] != label[2:]: >... raise ValueError('%s followed by %s' % >... (self.last_label, label)) >... self.last_label = label >... return self.last_result >... >py> def get_runs(lst): >... for _, item in itertools.groupby(lst, K()): >... result = list(item) >... if result != ['O']: >... yield result >... >py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X'])) >[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']] >py> list(get_runs(['O', 'I_Y'])) >Traceback (most recent call last): > ... >ValueError: O followed by I_Y >py> list(get_runs(['B_X', 'I_Y'])) >Traceback (most recent call last): > ... >ValueError: B_X followed by I_Y > >Can anyone see another way to do this? > Well, there's no end to "another way", but above is my .02USD ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list