r.e.s. wrote: > I have a million-line text file with 100 characters per line, > and simply need to determine how many of the lines are distinct. > > On my PC, this little program just goes to never-never land: > > def number_distinct(fn): > f = file(fn) > x = f.readline().strip() > L = [] > while x<>'': > if x not in L: > L = L + [x] > x = f.readline().strip() > return len(L) > > Would anyone care to point out improvements? > Is there a better algorithm for doing this?
Have you tried cat file | sort | uniq | wc -l ? sort might choke on the large file, and this isn't python, but it might work. You might try breaking the file into smaller peices, maybe based on the first character, and then process them seperately. The time killer is probably the "x not in L" line, since L is getting very large. By subdividing the problem initially, that time constraint will be better. -- http://mail.python.org/mailman/listinfo/python-list