Re: key/value store optimized for disk storage

2012-05-07 Thread Steve Howell
On May 6, 10:21 pm, John Nagle wrote: > On 5/4/2012 12:14 AM, Steve Howell wrote: > > > On May 3, 11:59 pm, Paul Rubin  wrote: > >> Steve Howell  writes: > >>>      compressor = zlib.compressobj() > >>>      s = compressor.compress("foobar") > >>>      s += compressor.flush(zlib.Z_SYNC_FLUSH) > >

Re: key/value store optimized for disk storage

2012-05-06 Thread Paul Rubin
John Nagle writes: >That's awful. There's no point in compressing six characters > with zlib. Zlib has a minimum overhead of 11 bytes. You just > made the data bigger. This hack is about avoiding the initialization overhead--do you really get 11 bytes after every SYNC_FLUSH? I do remember

Re: key/value store optimized for disk storage

2012-05-06 Thread John Nagle
On 5/4/2012 12:14 AM, Steve Howell wrote: On May 3, 11:59 pm, Paul Rubin wrote: Steve Howell writes: compressor = zlib.compressobj() s = compressor.compress("foobar") s += compressor.flush(zlib.Z_SYNC_FLUSH) s_start = s compressor2 = compressor.copy() That's a

Re: key/value store optimized for disk storage

2012-05-05 Thread Jon Clements
On Friday, 4 May 2012 16:27:54 UTC+1, Steve Howell wrote: > On May 3, 6:10 pm, Miki Tebeka wrote: > > > I'm looking for a fairly lightweight key/value store that works for > > > this type of problem: > > > > I'd start with a benchmark and try some of the things that are already in > > the standa

Re: key/value store optimized for disk storage

2012-05-04 Thread Emile van Sebille
On 5/4/2012 12:49 PM Tim Chase said... On 05/04/12 14:14, Emile van Sebille wrote: On 5/4/2012 10:46 AM Tim Chase said... I hit a few snags testing this on my winxp w/python2.6.1 in that getsize wasn't finding the file as it was created in two parts with .dat and .dir extension. Hrm...must be

Re: key/value store optimized for disk storage

2012-05-04 Thread Tim Chase
On 05/04/12 14:14, Emile van Sebille wrote: > On 5/4/2012 10:46 AM Tim Chase said... > > I hit a few snags testing this on my winxp w/python2.6.1 in that getsize > wasn't finding the file as it was created in two parts with .dat and > .dir extension. Hrm...must be a Win32 vs Linux thing. > Als

Re: key/value store optimized for disk storage

2012-05-04 Thread Emile van Sebille
On 5/4/2012 10:46 AM Tim Chase said... I hit a few snags testing this on my winxp w/python2.6.1 in that getsize wasn't finding the file as it was created in two parts with .dat and .dir extension. Also, setting key failed as update returns None. The changes I needed to make are marked below.

Re: key/value store optimized for disk storage

2012-05-04 Thread Tim Chase
On 05/04/12 12:22, Steve Howell wrote: > Which variant do you recommend? > > """ anydbm is a generic interface to variants of the DBM database > — dbhash (requires bsddb), gdbm, or dbm. If none of these modules > is installed, the slow-but-simple implementation in module > dumbdbm will be used. >

Re: key/value store optimized for disk storage

2012-05-04 Thread Tim Chase
On 05/04/12 10:27, Steve Howell wrote: > On May 3, 6:10 pm, Miki Tebeka wrote: >>> I'm looking for a fairly lightweight key/value store that works for >>> this type of problem: >> >> I'd start with a benchmark and try some of the things that are already in >> the standard library: >> - bsddb >> -

Re: key/value store optimized for disk storage

2012-05-04 Thread Steve Howell
On May 3, 6:10 pm, Miki Tebeka wrote: > > I'm looking for a fairly lightweight key/value store that works for > > this type of problem: > > I'd start with a benchmark and try some of the things that are already in the > standard library: > - bsddb > - sqlite3 (table of key, value, index key) > -

Re: key/value store optimized for disk storage

2012-05-04 Thread Paul Rubin
Steve Howell writes: >> You should be able to just get the incremental bit. > This is fixed now. Nice. > It it's in the header, wouldn't it be part of the output that comes > before Z_SYNC_FLUSH? Hmm, maybe you are right. My version was several years ago and I don't remember it well, but I hal

Re: key/value store optimized for disk storage

2012-05-04 Thread Steve Howell
On May 4, 1:01 am, Paul Rubin wrote: > Steve Howell writes: > > Makes sense.  I believe I got that part correct: > > >  https://github.com/showell/KeyValue/blob/master/salted_compressor.py > > The API looks nice, but your compress method makes no sense.  Why do you > include s.prefix in s and the

Re: key/value store optimized for disk storage

2012-05-04 Thread Paul Rubin
Steve Howell writes: > Makes sense. I believe I got that part correct: > > https://github.com/showell/KeyValue/blob/master/salted_compressor.py The API looks nice, but your compress method makes no sense. Why do you include s.prefix in s and then strip it off? Why do you save the prefix and

Re: key/value store optimized for disk storage

2012-05-04 Thread Steve Howell
On May 3, 11:59 pm, Paul Rubin wrote: > Steve Howell writes: > >     compressor = zlib.compressobj() > >     s = compressor.compress("foobar") > >     s += compressor.flush(zlib.Z_SYNC_FLUSH) > > >     s_start = s > >     compressor2 = compressor.copy() > > I think you also want to make a decompr

Re: key/value store optimized for disk storage

2012-05-04 Thread Dan Stromberg
On Thu, May 3, 2012 at 11:03 PM, Paul Rubin wrote: > > Sort of as you suggest, you could build a Huffman encoding for a > > representative run of data, save that tree off somewhere, and then use > > it for all your future encoding/decoding. > > Zlib is better than Huffman in my experience, and Py

Re: key/value store optimized for disk storage

2012-05-04 Thread Paul Rubin
Steve Howell writes: > compressor = zlib.compressobj() > s = compressor.compress("foobar") > s += compressor.flush(zlib.Z_SYNC_FLUSH) > > s_start = s > compressor2 = compressor.copy() I think you also want to make a decompressor here, and initialize it with s and then clone it

Re: key/value store optimized for disk storage

2012-05-03 Thread Steve Howell
On May 3, 11:03 pm, Paul Rubin wrote: > Steve Howell writes: > > Sounds like a useful technique.  The text snippets that I'm > > compressing are indeed mostly English words, and 7-bit ascii, so it > > would be practical to use a compression library that just uses the > > same good-enough encoding

Re: key/value store optimized for disk storage

2012-05-03 Thread Paul Rubin
Steve Howell writes: > Sounds like a useful technique. The text snippets that I'm > compressing are indeed mostly English words, and 7-bit ascii, so it > would be practical to use a compression library that just uses the > same good-enough encodings every time, so that you don't have to write > t

Re: key/value store optimized for disk storage

2012-05-03 Thread Steve Howell
On May 3, 9:38 pm, Paul Rubin wrote: > Steve Howell writes: > > My test was to write roughly 4GB of data, with 2 million keys of 2k > > bytes each. > > If the records are something like english text, you can compress > them with zlib and get some compression gain by pre-initializing > a zlib dict

Re: key/value store optimized for disk storage

2012-05-03 Thread Paul Rubin
Steve Howell writes: > My test was to write roughly 4GB of data, with 2 million keys of 2k > bytes each. If the records are something like english text, you can compress them with zlib and get some compression gain by pre-initializing a zlib dictionary from a fixed english corpus, then cloning it

Re: key/value store optimized for disk storage

2012-05-03 Thread Steve Howell
On May 3, 1:42 am, Steve Howell wrote: > On May 2, 11:48 pm, Paul Rubin wrote: > > > Paul Rubin writes: > > >looking at the spec more closely, there are 256 hash tables.. ... > > > You know, there is a much simpler way to do this, if you can afford to > > use a few hundred MB of memory and you d

Re: key/value store optimized for disk storage

2012-05-03 Thread Miki Tebeka
> I'm looking for a fairly lightweight key/value store that works for > this type of problem: I'd start with a benchmark and try some of the things that are already in the standard library: - bsddb - sqlite3 (table of key, value, index key) - shelve (though I doubt this one) You might find that f

Re: key/value store optimized for disk storage

2012-05-03 Thread Kiuhnm
On 5/3/2012 10:42, Steve Howell wrote: On May 2, 11:48 pm, Paul Rubin wrote: Paul Rubin writes: looking at the spec more closely, there are 256 hash tables.. ... You know, there is a much simpler way to do this, if you can afford to use a few hundred MB of memory and you don't mind some loa

Re: key/value store optimized for disk storage

2012-05-03 Thread Steve Howell
On May 2, 11:48 pm, Paul Rubin wrote: > Paul Rubin writes: > >looking at the spec more closely, there are 256 hash tables.. ... > > You know, there is a much simpler way to do this, if you can afford to > use a few hundred MB of memory and you don't mind some load time when > the program first st

Re: key/value store optimized for disk storage

2012-05-02 Thread Paul Rubin
Paul Rubin writes: >looking at the spec more closely, there are 256 hash tables.. ... You know, there is a much simpler way to do this, if you can afford to use a few hundred MB of memory and you don't mind some load time when the program first starts. Just dump all the data sequentially into a

Re: key/value store optimized for disk storage

2012-05-02 Thread Paul Rubin
Steve Howell writes: > Doesn't cdb do at least one disk seek as well? In the diagram on this > page, it seems you would need to do a seek based on the value of the > initial pointer (from the 256 possible values): Yes, of course it has to seek if there is too much data to fit in memory. All I'm

Re: key/value store optimized for disk storage

2012-05-02 Thread Steve Howell
On May 2, 8:29 pm, Paul Rubin wrote: > Steve Howell writes: > > Thanks.  That's definitely in the spirit of what I'm looking for, > > although the non-64 bit version is obviously geared toward a slightly > > smaller data set.  My reading of cdb is that it has essentially 64k > > hash buckets, so

Re: key/value store optimized for disk storage

2012-05-02 Thread William R. Wing (Bill Wing)
On May 2, 2012, at 10:14 PM, Steve Howell wrote: > This is slightly off topic, but I'm hoping folks can point me in the > right direction. > > I'm looking for a fairly lightweight key/value store that works for > this type of problem: > > ideally plays nice with the Python ecosystem > the data

Re: key/value store optimized for disk storage

2012-05-02 Thread Paul Rubin
Steve Howell writes: > Thanks. That's definitely in the spirit of what I'm looking for, > although the non-64 bit version is obviously geared toward a slightly > smaller data set. My reading of cdb is that it has essentially 64k > hash buckets, so for 3 million keys, you're still scanning throug

Re: key/value store optimized for disk storage

2012-05-02 Thread Steve Howell
On May 2, 7:46 pm, Paul Rubin wrote: > Steve Howell writes: > >   keys are file paths > >   directories are 2 levels deep (30 dirs w/100k files each) > >   values are file contents > > The current solution isn't horrible, > > Yes it is ;-) > > As I mention up top, I'm mostly hoping folks can poin

Re: key/value store optimized for disk storage

2012-05-02 Thread Tim Chase
On 05/02/12 21:14, Steve Howell wrote: > I'm looking for a fairly lightweight key/value store that works for > this type of problem: > > ideally plays nice with the Python ecosystem > the data set is static, and written infrequently enough that I > definitely want *read* performance to trump a

Re: key/value store optimized for disk storage

2012-05-02 Thread Terry Reedy
On 5/2/2012 10:14 PM, Steve Howell wrote: This is slightly off topic, but I'm hoping folks can point me in the right direction. I'm looking for a fairly lightweight key/value store that works for this type of problem: ideally plays nice with the Python ecosystem the data set is static, an

Re: key/value store optimized for disk storage

2012-05-02 Thread Paul Rubin
Steve Howell writes: > keys are file paths > directories are 2 levels deep (30 dirs w/100k files each) > values are file contents > The current solution isn't horrible, Yes it is ;-) > As I mention up top, I'm mostly hoping folks can point me toward > sources they trust, whether it be ot

key/value store optimized for disk storage

2012-05-02 Thread Steve Howell
This is slightly off topic, but I'm hoping folks can point me in the right direction. I'm looking for a fairly lightweight key/value store that works for this type of problem: ideally plays nice with the Python ecosystem the data set is static, and written infrequently enough that I definitel