py> data = [[('a', 0), ... ('b', 1), ... ('c', 2)], ... ... [('d', 2), ... ('e', 0)], ... ... [('f', 0), ... ('g', 2), ... ('h', 1), ... ('i', 0), ... ('j', 0)]]
Now, I'd like to sample down the number of items in each sublist in the following manner. I need to count the occurrences of each 'label' (the second item in each tuple) in all the items of all the sublists, and randomly remove some items until the number of occurrences of each 'label' is equal. So, given the data above, one possible resampling would be:
[[('b', 1), ('c', 2)],
[('e', 0)],
[('g', 2), ('h', 1), ('i', 0)]]
Note that there are now only 2 examples of each label. I have code that does this, but it's a little complicated:
py> import random py> def resample(data): ... # determine which indices are associated with each label ... label_indices = {} ... for i, group in enumerate(data): ... for j, (item, label) in enumerate(group): ... label_indices.setdefault(label, []).append((i, j)) ... # sample each set of indices down ... min_count = min(len(indices) ... for indices in label_indices.itervalues()) ... for label, indices in label_indices.iteritems(): ... label_indices[label] = random.sample(indices, min_count) ... # return the resampled data ... return [[(item, label) ... for j, (item, label) in enumerate(group) ... if (i, j) in label_indices[label]] ... for i, group in enumerate(data)] ... py> py> resample(data) [[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]] py> resample(data) [[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]
Can anyone see a simpler way of doing this?
Steve -- http://mail.python.org/mailman/listinfo/python-list