Re: [elephant-devel] working with many millions of objects

Red Daly Wed, 11 Oct 2006 17:57:37 -0700

I was importing into sleepycat using standard elephant routines. I amnot aware of an 'import mode' for sleepycat, but I will look into thatwhen I have a chance. Another consideration using sleepycat is thatusing BTrees with a large working set demands large amounts of memoryrelative to a Hash representation. I am unfamiliar with the internalsof elephant and sleepycat, but it feels like the basic access method isrestricting performance, which seems to be described here:

http://www.sleepycat.com/docs/gsg/C/accessmethods.html#BTreeVSHash

My problem so far has been importing the data, which goes very fastuntil sleepycat requires extensive disk access. The in-memory rate isreasonable and would complete in a few hours. However, once diskoperations begin the import speed suggests it would take many days tocomplete. I have yet to perform extensive benchmarks, but I estimatethe instantiation rate shifts from 1800 persistent class instantiations/second to 120 / s.

here are the two approaches that I hypothesize may help performance. Iam admittedly unaware of innards of the two systems in question, so youexpert developers will know best. If either sounds appropriate or youenvision another possibility for allowing this kind of scaling, I willlook into implementing such a system.

1. decreasing the size of the working set is one possibility fordecreasing run-time memory requirements and disk access. I'm not surehow the concept of a 'working set' translates from the sleepycat worldto the elephant world, but perhaps you do.

2. using a Hash instead of a BTree in the primary database? I amunsure what this means for elephant.

In the mean time I will depart from the every-class-is-persistentapproach and also use more traditional data structures.


Thanks again,
Red Daly



Robert L. Read wrote:

Yes, it's amusing.
In my own work I use the Postgres backend; I know very little aboutSleepyCat. It seemsto me that this is more of a SleepyCat issue, then an Elephant issue.Perhaps you should
ask the SleepyCat list?
Are you importing things into SleepyCat directly in the correctserialization format thatthey can be read by Elephant? If so, I assume it is just a questionof solving the SleepyCat
problems.
An alternative would be to use the SQL-based backend. However, Idoubt this will solveyour problem, since at present we (well, I wrote it) use a veryinefficient serialization schemefor the SQL-based backend that base64 encodes everything. This hadthe advantage thatit makes it work trouble-free with different database backends, butcould clearly be improved upon.However, it is more than efficient enough for all my work, and atpresent nobody is clamoring
to have it improved.
Is your problem importing the data or using it once it is imported?It's hard for me to imaginea problem so large that even the import time is a problem --- supposeit takes 24 hours --- can
you not afford to pay that?
A drastic measure and potentially expensive measure would be to switchto a 64-bit architecturewith a huge memory. I intend to do that when forced by performanceissues in my own work.
On Tue, 2006-10-10 at 00:46 -0700, Red Daly wrote:
I will be running experiments in informatics and modeling in the futurethat may contain (tens or hundreds of) millions of objects. Given theease of use of elephant so far, it would be great to use it as thepersistent store and avoid creating too many custom data structures.
I have recently run up against some performance bottlenecks when usingelephant to work with very large datasets (in the hundreds of millionsof objects). Using SleepyCat, I am able to import data very quicklywith a DB_CONFIG file with the following contents:
set_lk_max_locks 500000
set_lk_max_objects 500000
set_lk_max_lockers 500000
set_cachesize 1 0 0
I can import data very quickly until the 1 gb cache is too small toallow complete in-memory access to the database. at this point it seemsthat disk IO makes additional writes happen much slower. (I have alsotried increasing the 1 gb cache size, but the database fails to open ifit is too large--e.g. 2 gbs. I have 1.25 gb physical memory and 4 gbswap, so the constraint seems to be physical memory.) the max_lock,etc. lines allow transactions to contain hundreds of thousands ofindividual locks, limiting the transaction throughput bottleneck
What are the technical restrictions on writing several million objectsto the datastore? Is it feasible to create a batch import feature toallow large datasets to be imported using reasonable amounts of memoryfor a desktop computer?
I hope this email is at least amusing!

Thanks again,
red daly
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net <mailto:elephant-devel@common-lisp.net>
http://common-lisp.net/mailman/listinfo/elephant-devel
------------------------------------------------------------------------

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] working with many millions of objects

Reply via email to