"p." <[EMAIL PROTECTED]> writes: > So as an exercise, lets assume 800MB file, each line of data taking up > roughly 150B (guesstimate - based on examination of sample data)...so > roughly 5.3 million unique IDs.
I still don't understand what the problem is. Are you familiar with the concept of external sorting? What OS are you using? If you're using a Un*x-like system, the built-in sort command should do what you need. "Internal" sorting means reading a file into memory and sorting it in memory with something like the .sort() function. External sorting is what you do when the file won't fit in memory. Basically you read sequential chunks of the file where each chunk fits in memory, sort each chunk internally and write it to a temporary disk file, then merge all the disk files. You can sort inputs of basically unlimited size this way. The unix sort command knows how to do this. It's often a good exercise with this type of problem, to ask yourself how an old-time mainframe programmer would have done it. A "big" computer of the 1960's might have had 128 kbytes of memory and a few MB of disk, but a bunch of magtape drives that held a few dozen MB each. With computers like that, they managed to process the phone bills for millions of people. The methods that they used are still relevant with today's much bigger and faster computers. If you watch old movies that tried to get a high tech look by showing computer machine rooms with pulsating tape drives, external sorting is what those computers spent most of their time doing. Finally, 800MB isn't all that big a file by today's standards. Memory for desktop computers costs around 25 dollars per gigabyte so having 8GB of ram on your desk to crunch those 800MB files with is not at all unreasonable. -- http://mail.python.org/mailman/listinfo/python-list