Jeremy wrote:

Shelve looks like an interesting option, but what might pose an issue
is that I'm reading the data from a disk instead of memory.  I didn't
mention this in my original post, but I was hoping that by using a
database it would be more memory efficient in storing data in RAM so I
wouldn't have to read from (or swap to/from) disk.  Would using the
shelve package make reading/writing data from disk faster since it is
in a binary format?

Read the docs:

"class shelve.BsdDbShelf(dict[, protocol=None[, writeback=False]])¶
A subclass of Shelf which exposes first(), next(), previous(), last() and set_location() which are available in the bsddb module but not in other database modules. The dict object passed to the constructor must support those methods. This is generally accomplished by calling one of bsddb.hashopen(), bsddb.btopen() or bsddb.rnopen(). The optional protocol and writeback parameters have the same interpretation as for the Shelf class."

Apparently using shelve internally gives you option of using bsddb, which is good news: bsddb is B-tree DB, which is highly efficient for finding keys. I would recommend bsddb.btopen(), as it creates B-tree DB (perhaps other implementations, like anydb or hash db are good as well, but I personally didn't test them out).

I can't say for Berkeley DB implementation, but in general B-tree algorithm has O(log2 n) complexity for finding keys, which roughly means that if you need to find particular key in a db of 1 million keys, you'll probably need ~20 disk accesses (or even less if some keys looked at in the process of search happen to be in the same disk sectors). So yes, it's highly efficient.

Having said that, remember that disk is many orders of magnitude slower than RAM, so it's no free lunch.. Nothing will beat memory-based data structure when it comes to speed (well new flash or hybrid disks perhaps could significantly improve in comparison to current mainstream mechanical-platter disks? there are some hyper-fast storage hardware companies out there, although they tend to charge arm and leg for their stuff for now).

Caveat: Berkeley DB is dual-licensed -- if you're using it for commercial work, it might be that you'd need to buy a license for it. Although I have had no experience with this really, if someone here did perhaps they will shed some light on it?

Regards,
mk



--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to