r.e.s. wrote: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. > > On my PC, this little program just goes to never-never land: > > def number_distinct(fn): > f = file(fn) > x = f.readline().strip() > L = [] > while x<>'': > if x not in L: > L = L + [x] > x = f.readline().strip() > return len(L) > > Would anyone care to point out improvements? > Is there a better algorithm for doing this?
Sounds like homework, but I'll bite. def number_distinct(fn): hash_dict={} total_lines=0 for line in open(fn, 'r'): total_lines+=1 key=hash(line.strip()) if hash_dict.has_key(key): continue hash_dict[key]=1 return total_lines, len(hash_dict.keys()) if __name__=="__main__": fn='c:\\test.txt' total_lines, distinct_lines=number_distinct(fn) print "Total lines=%i, distinct lines=%i" % (total_lines, distinct_lines) -Larry Bates -- http://mail.python.org/mailman/listinfo/python-list