Re: Is there a faster way to do this?

Avinash Vora Tue, 05 Aug 2008 09:56:30 -0700

On Aug 5, 2008, at 10:00 PM, [EMAIL PROTECTED] wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Why not split the file into more manageable chunks, especially as it'sjust what seems like plaintext?

Heres the code:
import string

def checkForProduct(product_id, product_list):
   for product in product_list:
       if product == product_id:
           return 1
   return 0


input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
   break_down = line.split(",")
   product_number = break_down[2]
   input_count+=1
   if input_count == 1:
       product_info.append(product_number)
       output.write(line)
       output_count = 1


This seems redundant.

   if not checkForProduct(product_number,product_info):
       product_info.append(product_number)
       output.write(line)
       output_count+=1

File writing is extremely expensive. In fact, so is reading. Thinkabout reading the file in whole chunks. Put those chunks into Pythondata structures, and make your output information in Python datastructures. If you use a dictionary and search the ID's there, you'llnotice some speed improvements as Python does a dictionary lookup farquicker than searching a list. Then, output your data all at once atthe end.


--
Avi

--
http://mail.python.org/mailman/listinfo/python-list

Re: Is there a faster way to do this?

Reply via email to