> On Apr 15, 2018, at 5:44 PM, Peter Norvig <[email protected]> wrote:
>
> If you think of a Counter as a multiset, then it should support __or__, not
> __add__, right?
FWIW, Counter is explicitly documented to support the four multiset-style
mathematical operations discussed in Knuth TAOCP Volume II section 4.6.3
exercise 19:
>>> c = Counter(a=3, b=1)
>>> d = Counter(a=1, b=2)
>>> c + d # add two counters together: c[x] + d[x]
Counter({'a': 4, 'b': 3})
>>> c - d # saturating subtraction (keeping only positive
>>> counts)
Counter({'a': 2})
>>> c & d # intersection: min(c[x], d[x])
Counter({'a': 1, 'b': 1})
>>> c | d # union: max(c[x], d[x])
Counter({'a': 3, 'b': 2})
The wikipedia article on Multisets lists a further operation, inclusion, that
is not currently supported:
https://en.wikipedia.org/wiki/Multiset#Basic_properties_and_operations
> I do think it would have been fine if Counter did not support "+" at all
> (and/or if Counter was limited to integer values). But given where we are
> now, it feels like we should preserve `c + c == 2 * c`.
The + operation has legitimate use cases (it is perfectly reasonable to want to
combine the results two separate counts). And, as you pointed out, it is what
we already have and cannot change :-)
So, the API design issue that confronts us is that it would be a bit weird and
disorienting for the arithmetic operators to have two different signatures:
<counter> += <counter>
<counter> -= <counter>
<counter> *= <scalar>
<counter> /= <scalar>
Also, we should respect the comments given by others on the tracker issue. In
particular, there is a preference to not have an in-place operation and only
allow a new counter instance to be created. That will help people avoid data
structure modality problems:
.
c[category] += 1 # Makes sense during the frequency counting or
accumulation phase
c /= c.total # Covert to a probability mass function
c[category] += 1 # This code looks correct but no longer makes any
sense
> As to the "doesn't really add any new capabilities" argument, that's true,
> but it is also true for Counter as a whole: it doesn't add much over
> defaultdict(int), but it is certainly convenient to have a standard way to do
> what it does.
IIRC, the defaultdict(int) in your first version triggered a bug because the
model inadvertently changed during the analysis phase rather than being frozen
after the training phase. The Counter doesn't suffer from the same issue
(modifying the dict on a failed lookup). Also, the Counter class does have a
few value added features: Counter(iterable), c.most_common(), c.elements(),
etc. But yes, at its heart the counter is mostly just a specialized
dictionary. The thought I was trying to express is that suggestions to build
out Counter API are a little less compelling when we already have a way to do
it that is flexible, fast, clear, and standard (i.e. dict comprehensions).
> I agree with your intuition that low level is better. `total` would be
> useful. If you have total and mul, then as you and others have pointed out,
> normalize is just c *= 1/c.total.
I fully support adding some functionality for scaling to support probability
distributions, bayesian update steps, chi-square tests, and whatnot. The
people who need convincing are the other respondents on the tracker. They had
a strong mental model for the Counter class that is somewhat at odds with this
proposal.
Raymond
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/