Bill Pursell wrote: > Have you tried > cat file | sort | uniq | wc -l ?
The standard input file descriptor of sort can be attached directly to a file. You don't need a file catenating process in order to feed it: sort < file | uniq | wc -l Sort has the uniq functionality built in: sort -u < file | wc -l > sort might choke on the large file, and this isn't python, but it > might work. Solid implementations of sort can use external storage for large files, and perform a poly-phase type sort, rather than doing the entire sort in memory. I seem to recall that GNU sort does something like this, using temporary files. Naively written Python code is a lot more likely to choke on a large data set. > You might try breaking the file into > smaller peices, maybe based on the first character, and then > process them seperately. No, the way this is done is simply to read the file and insert the data into an ordered data structure until memory fills up. After that, you keep reading the file and inseting, but each time you insert, you remove the smallest element and write it out to the segment file. You keep doing it until it's no longer possible to extract a smallest element which is greater than all that have been already written to the file. When that happens, you start a new file. That does not happen until you have filled memory at least twice. So for instance with half a gig of RAM, you can produce merge segments on the order of a gig. -- http://mail.python.org/mailman/listinfo/python-list