On Wed, Aug 27, 2014 at 1:53 PM, Rodrick Brown <rodrick.br...@gmail.com> wrote:
def wc1(): > word="" > m={} > for c in s: > if c != " ": > word += c > else: > if m.has_key(word): > m[word] += 1 > else: > m[word] = 1 > word="" > return(m) > def wc2(): > m={} > for c in s.split(): > if m.has_key(c): > m[c] += 1 > else: > m[c] = 1 > return(m) On Wed, Aug 27, 2014 at 2:21 PM, Tim Chase <python.l...@tim.thechases.com> wrote: > > The thing that surprises me is that using collections.Counter() and > collections.defaultdict(int) are also slower than your wc2(): > > from collections import defaultdict, Counter > > def wc3(): > return Counter(s.split()) > > def wc4(): > m = defaultdict(int) > for c in s.split(): > m[c] += 1 > return m > I ran a couple more experiments, and, at least on my machine, it has to do with the number of duplicate words found. I also added two more variations of my own: def wc5(s): # I expect this one to be slow. m = {} for c in s.split(): m.setdefault(c, 0) m[c] += 1 return m def wc6(s): # This one might be better than any other option presented so far. m = {} for c in s.split(): try: m[c] += 1 except KeyError: m[c] = 1 return m I also switched the OP's versions to use the "in" operator rather than has_key, as I am running Python 3.4.1. With the same dataset (plus a trailing space) the OP provided, here are my times: (s = "The black cat jump over the bigger black cat ") >>> timeit.timeit("wc1(s)", setup=setup, number=1000000) 6.076951338314008 *>>> timeit.timeit("wc2(s)", setup=setup, number=1000000)* *2.451220378346954* >>> timeit.timeit("wc3(s)", setup=setup, number=1000000) 5.249674617410577 >>> timeit.timeit("wc4(s)", setup=setup, number=1000000) 3.531042215121076 >>> timeit.timeit("wc5(s)", setup=setup, number=1000000) 3.4734603842861205 >>> timeit.timeit("wc6(s)", setup=setup, number=1000000) 4.322543365103378 When I increase the data set by multipling the OP's string 1000 times (s = "The black cat jump over the bigger black cat "*1000), here are the times (I reduced the number of repetitions to keep the time reasonable): >>> timeit.timeit("wc1(s)", setup=setup, number=1000) 5.807871555058417 >>> timeit.timeit("wc2(s)", setup=setup, number=1000) 2.3245083748933535 *>>> timeit.timeit("wc3(s)", setup=setup, number=1000)* *1.5722138905703211* >>> timeit.timeit("wc4(s)", setup=setup, number=1000) 1.901478857657942 >>> timeit.timeit("wc5(s)", setup=setup, number=1000) 3.065888476414475 >>> timeit.timeit("wc6(s)", setup=setup, number=1000) 2.0125233934956217 It seems that with a large number of duplicate words, the counter version is the best, however with fewer duplicates, the contains check is better. Chris
-- https://mail.python.org/mailman/listinfo/python-list