removing duplication from a huge list.

Shanmuga Rajan Thu, 26 Feb 2009 20:32:57 -0800

Hi


I have a list of Records with some details.(more than 15 million records)
with duplication

I need to iterate through every record and need to eliminate duplicate
records.

Currently i am using a script like this.


counted_recs = [ ]
x = some_fun()  #  will return a generator, this generator is the source of
list. because i dont want to carry entire 15 million records in a list(need
more 1 gb of memory)

for rec in x:
    if rec[0] not in counted_recs:
        #some logics goes here...
        counted_recs.append(rec[0])        # i need to have rec[0]=name
alone from record.


but i am sure this is not a optimized way to do.
so i came up with a different solution. but i am not confident in that
solution too

here my second solution.

counted_recs = []

x = some_fun()

#x = [ rec[0] for rec in x]

for rec in x:
    if counted_recs.count(rec[0]) > 0 :
        # some logic goes here
        counted_recs.append(rec[0])


which one is better?  if any one suggests better solution then i will be
very happy.

Advance thanks for any help.


Shan

--
http://mail.python.org/mailman/listinfo/python-list

removing duplication from a huge list.

Reply via email to