On May 3, 1:42 am, Steve Howell <showel...@yahoo.com> wrote:
> On May 2, 11:48 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
>
> > Paul Rubin <no.em...@nospam.invalid> writes:
> > >looking at the spec more closely, there are 256 hash tables.. ...
>
> > You know, there is a much simpler way to do this, if you can afford to
> > use a few hundred MB of memory and you don't mind some load time when
> > the program first starts.  Just dump all the data sequentially into a
> > file.  Then scan through the file, building up a Python dictionary
> > mapping data keys to byte offsets in the file (this is a few hundred MB
> > if you have 3M keys).  Then dump the dictionary as a Python pickle and
> > read it back in when you start the program.
>
> > You may want to turn off the cyclic garbage collector when building or
> > loading the dictionary, as it badly can slow down the construction of
> > big lists and maybe dicts (I'm not sure of the latter).
>
> I'm starting to lean toward the file-offset/seek approach.  I am
> writing some benchmarks on it, comparing it to a more file-system
> based approach like I mentioned in my original post.  I'll report back
> when I get results, but it's already way past my bedtime for tonight.
>
> Thanks for all your help and suggestions.

I ended up going with the approach that Paul suggested (except I used
JSON instead of pickle for persisting the hash).  I like it for its
simplicity and ease of troubleshooting.

My test was to write roughly 4GB of data, with 2 million keys of 2k
bytes each.

The nicest thing was how quickly I was able to write the file.
Writing tons of small files bogs down the file system, whereas the one-
big-file approach finishes in under three minutes.

Here's the code I used for testing:

https://github.com/showell/KeyValue/blob/master/test_key_value.py

Here are the results:

~/WORKSPACE/KeyValue > ls -l values.txt hash.txt
-rw-r--r--  1 steve  staff    44334161 May  3 18:53 hash.txt
-rw-r--r--  1 steve  staff  4006000000 May  3 18:53 values.txt

2000000 out of 2000000 records yielded (2k each)
Begin READING test
num trials 100000
time spent 39.8048191071
avg delay 0.000398048191071

real    2m46.887s
user    1m35.232s
sys     0m19.723s
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to