Maybe this should be implemented in C. But I believe that the algorithm itself must be wrong (regardless of the language). I really think that I'm doing something wrong. Looks like my algorithm's processing time is not linear to the number of rows. Not even log(n)*n. There should be a more effective way to do this. But how?

I had the idea of sorting the rows by a given dimension. Then it would be obvious to speed up the indexing part - for that dimension. PROBABLY sorting all rows would be faster than calling list.index() for each row. But... it does not seem very efficient either. What if I have 1 million rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times? Some of you might have ran into the same problem before, and can tell me which is the most efficient way to do this.

The .index method does a linear search, checking on average 1/2 of the items in the list. That's why it's so slow.

In order to avoid that you could build a dict of each value in
dimension_values[col_idx] and its index in a single pass so that it
becomes a quick lookup.
Changed to dicts and hashed lookups. Now the processing time is O(n), and went up to 8000 rows/sec.

Probably I'll never want to process more than 1M rows. That will take about 125 seconds. Fair enough.

Thank you very much!

  Laszlo

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to