grouping subsequences with BIO tags

Steven Bethard Thu, 21 Apr 2005 14:40:05 -0700

I have a list of strings that looks something like: ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X'] I need to group the strings into runs (lists) using the following rules based on the string prefix: 'O' is discarded 'B_...' starts a new run 'I_...' continues a run started by a 'B_...' So, the example above should look like: [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I check for errors as well. 'I_...' must always follow 'B_...', so errors look like: ['O', 'I_...'] ['B_xxx', 'I_yyy'] where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the 'B_...' is different from the suffix of the 'I_...'.

This is the best I've come up with so far:

py> class K(object):
...     def __init__(self):
...         self.last_result = False
...         self.last_label = 'O'
...     def __call__(self, label):
...         if label[:2] in ('O', 'B_'):
...             self.last_result = not self.last_result
...         elif self.last_label[2:] != label[2:]:
...             raise ValueError('%s followed by %s' %
...                              (self.last_label, label))
...         self.last_label = label
...         return self.last_result
...
py> def get_runs(lst):
...     for _, item in itertools.groupby(lst, K()):
...         result = list(item)
...         if result != ['O']:
...             yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
  ...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
  ...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

STeVe
--
http://mail.python.org/mailman/listinfo/python-list

grouping subsequences with BIO tags

Reply via email to