per wrote: > hi all, > > i have a program that essentially loops through a textfile file thats > about 800 MB in size containing tab separated data... my program > parses this file and stores its fields in a dictionary of lists. > > for line in file: > split_values = line.strip().split('\t') > # do stuff with split_values > > currently, this is very slow in python, even if all i do is break up > each line using split() and store its values in a dictionary, indexing > by one of the tab separated values in the file. > > is this just an overhead of python that's inevitable? do you guys > think that switching to cython might speed this up, perhaps by > optimizing the main for loop? or is this not a viable option?
Any time I see large data structures, I always think of memory consumption and paging. How much memory do you have? My back-of-the-envelope estimate is that you need at least 1.2 GB to store the 800MB of text, more if the text is Unicode or if you're on a 64-bit system. If your computer only has 1GB of memory, it's going to be struggling; if it has 2GB, it might be a little slow, especially if you're running other programs at the same time. If that's the problem, the solution is: get more memory. Apart from monitoring virtual memory use, another test you could do is to see if the time taken to build the data structures scales approximately linearly with the size of the data. That is, if it takes 2 seconds to read 80MB of data and store it in lists, then it should take around 4 seconds to do 160MB and 20-30 seconds to do 800MB. If your results are linear, then there's probably nothing much you can do to speed it up, since the time it probably dominated by file I/O. On the other hand, if the time scales worse than linear, there may be hope to speed it up. -- Steven -- http://mail.python.org/mailman/listinfo/python-list