On May 3, 1:42 am, Steve Howell <showel...@yahoo.com> wrote: > On May 2, 11:48 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > > > Paul Rubin <no.em...@nospam.invalid> writes: > > >looking at the spec more closely, there are 256 hash tables.. ... > > > You know, there is a much simpler way to do this, if you can afford to > > use a few hundred MB of memory and you don't mind some load time when > > the program first starts. Just dump all the data sequentially into a > > file. Then scan through the file, building up a Python dictionary > > mapping data keys to byte offsets in the file (this is a few hundred MB > > if you have 3M keys). Then dump the dictionary as a Python pickle and > > read it back in when you start the program. > > > You may want to turn off the cyclic garbage collector when building or > > loading the dictionary, as it badly can slow down the construction of > > big lists and maybe dicts (I'm not sure of the latter). > > I'm starting to lean toward the file-offset/seek approach. I am > writing some benchmarks on it, comparing it to a more file-system > based approach like I mentioned in my original post. I'll report back > when I get results, but it's already way past my bedtime for tonight. > > Thanks for all your help and suggestions.
I ended up going with the approach that Paul suggested (except I used JSON instead of pickle for persisting the hash). I like it for its simplicity and ease of troubleshooting. My test was to write roughly 4GB of data, with 2 million keys of 2k bytes each. The nicest thing was how quickly I was able to write the file. Writing tons of small files bogs down the file system, whereas the one- big-file approach finishes in under three minutes. Here's the code I used for testing: https://github.com/showell/KeyValue/blob/master/test_key_value.py Here are the results: ~/WORKSPACE/KeyValue > ls -l values.txt hash.txt -rw-r--r-- 1 steve staff 44334161 May 3 18:53 hash.txt -rw-r--r-- 1 steve staff 4006000000 May 3 18:53 values.txt 2000000 out of 2000000 records yielded (2k each) Begin READING test num trials 100000 time spent 39.8048191071 avg delay 0.000398048191071 real 2m46.887s user 1m35.232s sys 0m19.723s -- http://mail.python.org/mailman/listinfo/python-list