On Feb 27, 8:33 am, Tim Rowe <digi...@gmail.com> wrote: > 2009/2/27 odeits <ode...@gmail.com>: > > > How big of a list are we talking about? If the list is so big that the > > entire list cannot fit in memory at the same time this approach wont > > work e.g. removing duplicate lines from a very large file. > > We were told in the original question: more than 15 million records, > and it won't all fit into memory. So your observation is pertinent. > > -- > Tim Rowe
If order did matter, and the list itself couldn't be stored in memory, I would personally do some sort of hash of each item (or something as simple as first 5 bytes, last 5 bytes and length), keeping a reference to which item the hash belongs, sort and identify duplicates in the hash, and using the reference check to see if the actual items in question match as well. Pretty brutish and slow, but it's the first algorithm which comes to mind. Of course, I'm assuming that the list items are long enough to warrant using a hash and not the values themselves. ~G -- http://mail.python.org/mailman/listinfo/python-list