possibilitybox wrote:
this code here:


def wordcount(lines): for i in range(len(lines)/8): words = lines[i].split(" ") if not locals().has_key("frequency"): frequency = {} for word in words: if frequency.has_key(word): frequency[word] += 1 else: frequency[word] = 1 return frequency wordcount(lines)

is taking over six minutes to run on a two megabyte text file.  i
realize that's a big text file, a really big one (it's actually the
full text of don quixote.).  i'm trying to figure out how.  is there a
better way for me to do a frequency count of all the words in the text?
 it seems to me like this should scale linearly, but perhaps it isn't?
i don't know much about algorithmic complexity.  if someone could give
a breakdown of this functions complexity as well i'd be much obliged.

lines is expected to be a list of lines as provided by file.readline()


Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you running on?


path = 'DonQuixote.txt'

frequency = {}

for line in open(path):
    for word in line.split():
        if frequency.has_key(word):
            frequency[word] += 1
        else:
            frequency[word] = 1

print len(frequency), 'words'


Kent -- http://mail.python.org/mailman/listinfo/python-list

Reply via email to