Steven D'Aprano wrote:
per wrote:
currently, this is very slow in python, even if all i do is break up
each line using split()
******************
and store its values in a dictionary,
******************
indexing by one of the tab separated values in the file.
If that's the problem, the solution is: get more memory.
Steven caught the "and store its values in a dictionary" (which I
missed previously and accentuated in the above quote). The one
missing pair of factors you omitted:
1) how many *lines* are in this file (or what's the average
line-length). You can use the following code both to find out
how many lines are in the file, and to see how long it takes
Python to skim through an 800 meg file just in terms of file-I/O:
i = 0
for line in file('in.txt'):
i += 1
print "%i lines" % i
2) how much overlap/commonality is there in the keys between
lines? Does every line create a new key, in which case you're
adding $LINES keys to your dictionary? or do some percentage of
lines overwrite entries in your dictionary with new values?
After one of your slow runs, issue a
print len(my_dict)
to see how many keys are in the final dict.
If you end up having millions of keys into your dict, you may be
able to use the "bdb" module to store your dict on-disk and save
memory. Doing access to *two* files may not get you great wins
in speed, but you at least won't be thrashing your virtual memory
with a huge dict, so performance in the rest of your app may not
experience similar problems due to swapping into virtual memory.
This has the added advantage that, if your input file doesn't
change, you can simply reuse the bdb database/dict file without
the need to rebuild its contents.
-tkc
--
http://mail.python.org/mailman/listinfo/python-list