Tim Rowe <digi...@gmail.com> writes: > We were told in the original question: more than 15 million records, > and it won't all fit into memory. So your observation is pertinent.
That is not terribly many records by today's standards. The knee-jerk approach is to sort them externally, then make a linear pass skipping the duplicates. Is the exercise to write an external sort in Python? It's worth doing if you've never done it before. -- http://mail.python.org/mailman/listinfo/python-list