Re: speeding up reading files (possibly with cython)

Tim Chase Sun, 08 Mar 2009 05:54:17 -0700

Steven D'Aprano wrote:

per wrote:

currently, this is very slow in python, even if all i do is break up
each line using split()

******************

and store its values in a dictionary,

******************

indexing by one of the tab separated values in the file.


If that's the problem, the solution is: get more memory.

Steven caught the "and store its values in a dictionary" (which Imissed previously and accentuated in the above quote). The onemissing pair of factors you omitted:

1) how many *lines* are in this file (or what's the averageline-length). You can use the following code both to find outhow many lines are in the file, and to see how long it takesPython to skim through an 800 meg file just in terms of file-I/O:


    i = 0
    for line in file('in.txt'):
      i += 1
    print "%i lines" % i

2) how much overlap/commonality is there in the keys betweenlines? Does every line create a new key, in which case you'readding $LINES keys to your dictionary? or do some percentage oflines overwrite entries in your dictionary with new values?After one of your slow runs, issue a


    print len(my_dict)

  to see how many keys are in the final dict.

If you end up having millions of keys into your dict, you may beable to use the "bdb" module to store your dict on-disk and savememory. Doing access to *two* files may not get you great winsin speed, but you at least won't be thrashing your virtual memorywith a huge dict, so performance in the rest of your app may notexperience similar problems due to swapping into virtual memory.This has the added advantage that, if your input file doesn'tchange, you can simply reuse the bdb database/dict file withoutthe need to rebuild its contents.


-tkc












--
http://mail.python.org/mailman/listinfo/python-list

Re: speeding up reading files (possibly with cython)

Reply via email to