Maybe this should be implemented in C. But I believe that the
algorithm itself must be wrong (regardless of the language). I really
think that I'm doing something wrong. Looks like my algorithm's
processing time is not linear to the number of rows. Not even
log(n)*n. There should be a more effective way to do this. But how?
I had the idea of sorting the rows by a given dimension. Then it
would be obvious to speed up the indexing part - for that dimension.
PROBABLY sorting all rows would be faster than calling list.index()
for each row. But... it does not seem very efficient either. What if
I have 1 million rows and 10 dimensions? Do I sort 1 million rows on
the disk 10 times? Some of you might have ran into the same problem
before, and can tell me which is the most efficient way to do this.
The .index method does a linear search, checking on average 1/2 of the
items in the list. That's why it's so slow.
In order to avoid that you could build a dict of each value in
dimension_values[col_idx] and its index in a single pass so that it
becomes a quick lookup.
Changed to dicts and hashed lookups. Now the processing time is O(n),
and went up to 8000 rows/sec.
Probably I'll never want to process more than 1M rows. That will take
about 125 seconds. Fair enough.
Thank you very much!
Laszlo
--
http://mail.python.org/mailman/listinfo/python-list