Steven Bethard wrote:
I have a list of dictionaries. Each dictionary holds counts of
various 'words'...
...
Basically, I use a greedy approach -- adding a dict each time if I
can.
... This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select
function produces such a suboptimal solution:
py> countdicts = [ ... dict(a=9, b=9, c=9), ... dict(a=8, b=7), ... dict(a=4, b=5, c=12)] py> select(countdicts, 12) [{'a': 9, 'c': 9, 'b': 9}]
The better solution would have been to select the second two dicts -- then all 'words' would have count 12.
I don't need an optimal solution; in fact, the solution I have right now is tolerable. But if anyone knows a simple way to improve my performance, I'd appreciate it. Thanks!
You should specify a function that assigns a score to valid solutions; it's not apparent from the wording above and the example exactly what that function might be. Thinking through the alternatives might help you solve the problem yourself :-)
Yeah, I guess that's a good idea. Here's a scoring function that should give the right range of scores (where 1.0 is perfect and 0.0 is failure):
py> from __future__ import division py> def score(origdicts, selecteddicts, n): ... origwords = set(word for d in origdicts for word in d) ... counts = {} ... for d in selecteddicts: ... for word, count in d.iteritems(): ... counts[word] = counts.get(word, 0) + count ... if counts[word] > n: ... return 0.0 ... return sum(counts.itervalues())/len(origwords)/n ... py> countdicts = [ ... dict(a=9, b=9, c=9), ... dict(a=8, b=7), ... dict(a=4, b=5, c=12)] py> score(countdicts, countdicts, 12) 0.0 py> score(countdicts, select(countdicts, 12), 12) 0.75 py> score(countdicts, [dict(a=8, b=7), dict(a=4, b=5, c=12)], 12) 1.0
It's pretty easy to build a configuration that can't be solved with a perfect score:
py> def subsets(lst):
... if not lst:
... yield []
... else:
... first = lst[:1]
... for subset in subsets(lst[1:]):
... yield subset
... yield first + subset
...
py> countdicts = [
... dict(a=1, b=2, c=3),
... dict(a=4, b=5, c=6),
... dict(a=7, b=8, c=9)]
py> for dicts in subsets(countdicts):
... print score(countdicts, dicts, 9), dicts
...
0.0 []
0.222222222222 [{'a': 1, 'c': 3, 'b': 2}]
0.555555555556 [{'a': 4, 'c': 6, 'b': 5}]
0.777777777778 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
0.888888888889 [{'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
Hmmm... I can't do an exhaustive search like this one here -- I have ~8000 dicts, so 2**N is definitely too large of a space to search. Now that I have a scoring routine maybe I could do an A* search though...
Thanks for the input!
STeVe -- http://mail.python.org/mailman/listinfo/python-list