Re: [elephant-devel] working with many millions of objects

Ian Eslick Thu, 15 Feb 2007 22:56:53 -0800

Some experiments performed on an old topic...

On Oct 11, 2006, at 8:57 PM, Red Daly wrote:

I was importing into sleepycat using standard elephant routines. Iam not aware of an 'import mode' for sleepycat, but I will lookinto that when I have a chance. Another consideration usingsleepycat is that using BTrees with a large working set demandslarge amounts of memory relative to a Hash representation. I amunfamiliar with the internals of elephant and sleepycat, but itfeels like the basic access method is restricting performance,which seems to be described here:
http://www.sleepycat.com/docs/gsg/C/accessmethods.html#BTreeVSHash
My problem so far has been importing the data, which goes very fastuntil sleepycat requires extensive disk access. The in-memory rateis reasonable and would complete in a few hours. However, oncedisk operations begin the import speed suggests it would take manydays to complete. I have yet to perform extensive benchmarks, butI estimate the instantiation rate shifts from 1800 persistent classinstantiations /second to 120 / s.

The biggest performance factor is properly managing transaction sizesto balance contention, total locks used, etc.

You can also turn off all transactional and log synchronization sinceif you crash, you can always restart a several hour download. Ithink this may avoid additional overhead, however I have notbenchmarked this. i.e.


(with-transaction (:txn-nosync t :dirty-read t)
   (create 500 objects))

here are the two approaches that I hypothesize may helpperformance. I am admittedly unaware of innards of the two systemsin question, so you expert developers will know best. If eithersounds appropriate or you envision another possibility for allowingthis kind of scaling, I will look into implementing such a system.
1. decreasing the size of the working set is one possibility fordecreasing run-time memory requirements and disk access. I'm notsure how the concept of a 'working set' translates from thesleepycat world to the elephant world, but perhaps you do.

What do you mean by working set? When loading stuff into a databaseyou are moving index pages around in the btree and allocating endlessamounts of leaf nodes. The index nodes are cacheable, but the leafnodes are definitely not! I think there are ways to add a bunch ofobjects and then force a btree to update all the index pages.Access to that functionality, if BDB even supports it, is notprovided in elephant.

2. using a Hash instead of a BTree in the primary database? I amunsure what this means for elephant.

I finally got around to trying this and it showed poorer performanceon a large stress test (create, modify and access 10k objects). Idon't have a good theory as to why it was slower other than in createwhere the hash table had to grow.

In the mean time I will depart from the every-class-is-persistentapproach and also use more traditional data structures.
Thanks again,
Red Daly



Robert L. Read wrote:
Yes, it's amusing.
In my own work I use the Postgres backend; I know very littleabout SleepyCat. It seemsto me that this is more of a SleepyCat issue, then an Elephantissue. Perhaps you should
ask the SleepyCat list?
Are you importing things into SleepyCat directly in the correctserialization format thatthey can be read by Elephant? If so, I assume it is just aquestion of solving the SleepyCat
problems.
An alternative would be to use the SQL-based backend. However, Idoubt this will solveyour problem, since at present we (well, I wrote it) use a veryinefficient serialization schemefor the SQL-based backend that base64 encodes everything. Thishad the advantage thatit makes it work trouble-free with different database backends,but could clearly be improved upon.However, it is more than efficient enough for all my work, and atpresent nobody is clamoring
to have it improved.
Is your problem importing the data or using it once it isimported? It's hard for me to imaginea problem so large that even the import time is a problem ---suppose it takes 24 hours --- can
you not afford to pay that?
A drastic measure and potentially expensive measure would be toswitch to a 64-bit architecturewith a huge memory. I intend to do that when forced byperformance issues in my own work.
On Tue, 2006-10-10 at 00:46 -0700, Red Daly wrote:
I will be running experiments in informatics and modeling in thefuture that may contain (tens or hundreds of) millions ofobjects. Given the ease of use of elephant so far, it would begreat to use it as the persistent store and avoid creating toomany custom data structures.
I have recently run up against some performance bottlenecks whenusing elephant to work with very large datasets (in the hundredsof millions of objects). Using SleepyCat, I am able to importdata very quickly with a DB_CONFIG file with the following contents:
set_lk_max_locks 500000
set_lk_max_objects 500000
set_lk_max_lockers 500000
set_cachesize 1 0 0
I can import data very quickly until the 1 gb cache is too smallto allow complete in-memory access to the database. at thispoint it seems that disk IO makes additional writes happen muchslower. (I have also tried increasing the 1 gb cache size, butthe database fails to open if it is too large--e.g. 2 gbs. Ihave 1.25 gb physical memory and 4 gb swap, so the constraintseems to be physical memory.) the max_lock, etc. lines allowtransactions to contain hundreds of thousands of individuallocks, limiting the transaction throughput bottleneck
What are the technical restrictions on writing several millionobjects to the datastore? Is it feasible to create a batchimport feature to allow large datasets to be imported usingreasonable amounts of memory for a desktop computer?
I hope this email is at least amusing!

Thanks again,
red daly
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net <mailto:[EMAIL PROTECTED]lisp.net>
http://common-lisp.net/mailman/listinfo/elephant-devel
------------------------------------------------------------------------
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] working with many millions of objects

Reply via email to