['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
At the same time that I'm extracting the runs, it's important that I check for errors as well. 'I_...' must always follow 'B_...', so errors look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the 'B_...' is different from the suffix of the 'I_...'.
This is the best I've come up with so far:
py> class K(object): ... def __init__(self): ... self.last_result = False ... self.last_label = 'O' ... def __call__(self, label): ... if label[:2] in ('O', 'B_'): ... self.last_result = not self.last_result ... elif self.last_label[2:] != label[2:]: ... raise ValueError('%s followed by %s' % ... (self.last_label, label)) ... self.last_label = label ... return self.last_result ... py> def get_runs(lst): ... for _, item in itertools.groupby(lst, K()): ... result = list(item) ... if result != ['O']: ... yield result ... py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X'])) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']] py> list(get_runs(['O', 'I_Y'])) Traceback (most recent call last): ... ValueError: O followed by I_Y py> list(get_runs(['B_X', 'I_Y'])) Traceback (most recent call last): ... ValueError: B_X followed by I_Y
Can anyone see another way to do this?
STeVe -- http://mail.python.org/mailman/listinfo/python-list