John Machin wrote:
Steven Bethard wrote:

I have a list of dictionaries. Each dictionary holds counts of
various 'words'...
...
Basically, I use a greedy approach -- adding a dict each time if I
can.
... This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select
function produces such a suboptimal solution:


py> countdicts = [
...     dict(a=9, b=9, c=9),
...     dict(a=8, b=7),
...     dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --
then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right
now is tolerable.  But if anyone knows a simple way to improve my
performance, I'd appreciate it.  Thanks!

You should specify a function that assigns a score to valid solutions; it's not apparent from the wording above and the example exactly what that function might be. Thinking through the alternatives might help you solve the problem yourself :-)

Yeah, I guess that's a good idea. Here's a scoring function that should give the right range of scores (where 1.0 is perfect and 0.0 is failure):


py> from __future__ import division
py> def score(origdicts, selecteddicts, n):
...     origwords = set(word for d in origdicts for word in d)
...     counts = {}
...     for d in selecteddicts:
...         for word, count in d.iteritems():
...             counts[word] = counts.get(word, 0) + count
...             if counts[word] > n:
...                 return 0.0
...     return sum(counts.itervalues())/len(origwords)/n
...
py> countdicts = [
...     dict(a=9, b=9, c=9),
...     dict(a=8, b=7),
...     dict(a=4, b=5, c=12)]
py> score(countdicts, countdicts, 12)
0.0
py> score(countdicts, select(countdicts, 12), 12)
0.75
py> score(countdicts, [dict(a=8, b=7), dict(a=4, b=5, c=12)], 12)
1.0

It's pretty easy to build a configuration that can't be solved with a perfect score:

py> def subsets(lst):
... if not lst:
... yield []
... else:
... first = lst[:1]
... for subset in subsets(lst[1:]):
... yield subset
... yield first + subset
...
py> countdicts = [
... dict(a=1, b=2, c=3),
... dict(a=4, b=5, c=6),
... dict(a=7, b=8, c=9)]
py> for dicts in subsets(countdicts):
... print score(countdicts, dicts, 9), dicts
...
0.0 []
0.222222222222 [{'a': 1, 'c': 3, 'b': 2}]
0.555555555556 [{'a': 4, 'c': 6, 'b': 5}]
0.777777777778 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
0.888888888889 [{'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]


Hmmm... I can't do an exhaustive search like this one here -- I have ~8000 dicts, so 2**N is definitely too large of a space to search. Now that I have a scoring routine maybe I could do an A* search though...

Thanks for the input!

STeVe
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to