So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like:

py> data = [[('a', 0),
...          ('b', 1),
...          ('c', 2)],
...
...         [('d', 2),
...          ('e', 0)],
...
...         [('f', 0),
...          ('g', 2),
...          ('h', 1),
...          ('i', 0),
...          ('j', 0)]]

Now, I'd like to sample down the number of items in each sublist in the following manner. I need to count the occurrences of each 'label' (the second item in each tuple) in all the items of all the sublists, and randomly remove some items until the number of occurrences of each 'label' is equal. So, given the data above, one possible resampling would be:

    [[('b', 1),
      ('c', 2)],

     [('e', 0)],

     [('g', 2),
      ('h', 1),
      ('i', 0)]]

Note that there are now only 2 examples of each label. I have code that does this, but it's a little complicated:

py> import random
py> def resample(data):
...     # determine which indices are associated with each label
...     label_indices = {}
...     for i, group in enumerate(data):
...         for j, (item, label) in enumerate(group):
...             label_indices.setdefault(label, []).append((i, j))
...     # sample each set of indices down
...     min_count = min(len(indices)
...                     for indices in label_indices.itervalues())
...     for label, indices in label_indices.iteritems():
...         label_indices[label] = random.sample(indices, min_count)
...     # return the resampled data
...     return [[(item, label)
...              for j, (item, label) in enumerate(group)
...              if (i, j) in label_indices[label]]
...             for i, group in enumerate(data)]
...
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]

Can anyone see a simpler way of doing this?

Steve
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to