I have a list of dictionaries. Each dictionary holds counts of various 'words', e.g.:

py> countdicts = [
...     dict(a=9, b=9, c=9),
...     dict(a=8, b=7),
...     dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each 'word' totalled over all selected dicts doesn't exceed a given MAX_VALUE. Right now, I do this by:

py> def select(dicts, n):
...     result = []
...     counts = {}
...     def doadd(d):
...         for label, count in d.iteritems():
...             if counts.get(label, 0) + count > n:
...                 return False
...         return True
...     for d in dicts:
...         if doadd(d):
...             result.append(d)
...             for label, count in d.iteritems():
...                 counts[label] = counts.get(label, 0) + count
...     return result
...

Basically, I use a greedy approach -- adding a dict each time if I can. This leads to some suboptimal solutions given that, while the total counts must not exceed MAX_VALUE, I'd like them to be as close to MAX_VALUE as possible. An example of data on which my select function produces such a suboptimal solution:

py> countdicts = [
...     dict(a=9, b=9, c=9),
...     dict(a=8, b=7),
...     dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts -- then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right now is tolerable. But if anyone knows a simple way to improve my performance, I'd appreciate it. Thanks!

STeVe

P.S. No, I'm not just making these problems up! I'm sick, but not that sick. ;) It has to do with resampling a set of data... If you're really interested, I can provide the full details, but they'll only make the problem even _more_ complicated. =)
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to