[EMAIL PROTECTED] wrote: > I wrote this function which does the following: > after readling lines from file.It splits and finds the word occurences > through a hash table...for some reason this is quite slow..can some one > help me make it faster... > f = open(filename) > lines = f.readlines() > def create_words(lines): > cnt = 0 > spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+' > for content in lines: > words=content.split() > countDict={} > wordlist = [] > for w in words: > w=string.lower(w) > if w[-1] in spl_set: w = w[:-1] > if w != '': > if countDict.has_key(w): > countDict[w]=countDict[w]+1 > else: > countDict[w]=1 > wordlist = countDict.keys() > wordlist.sort() > cnt += 1 > if countDict != {}: > for word in wordlist: print (word+' '+ > str(countDict[word])+'\n') > The way this is written you create a new countDict object for every line of the file, it's not clear that this is what you meant to do.
Also you are sorting wordlist for every line, not just the entire file because it is inside the loop that is processing lines. Some extra work by testing for empty dictionary: wordlist=countDict.keys() then if countdict != {}: for word in wordlist: if countDict is empty then wordlist will be empty so testing for it is unnecessary. Incrementing cnt, but never using it. I don't think spl_set will do what you want, but I haven't modified it. To split on all those characters you are going to need to use regular expressions not split. Modified code: def create_words(lines): spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+' countDict={} for content in lines: words=content.split() for w in words: w=w.lower() if w[-1] in spl_set: w = w[:-1] if w: if countDict.has_key(w): countDict[w]=countDict[w]+1 else: countDict[w]=1 return countDict import time filename=r'C:\cygwin\usr\share\vim\vim63\doc\version5.txt' f = open(filename) lines = f.readlines() start_time=time.time() countDict=create_words(lines) stop_time=time.time() elapsed_time=stop_time-start_time wordlist = countDict.keys() wordlist.sort() for word in wordlist: print "word=%s count=%i" % (word, countDict[word]) print "Elapsed time in create_words function=%.2f seconds" % elapsed_time I ran this against a 551K text file and it runs in 0.11 seconds on my machine (3.0Ghz P4). Larry Bates -- http://mail.python.org/mailman/listinfo/python-list