Can you tell me a little bit about what the import operations look like - that is to say how many objects are created per-transaction, how many slots per created object, etc? Things are only cached in-memory during a transaction. To ensure ACID properties (unless you've turned off synchronization) every transaction should flush to disk just prior to completion. It sounds almost like you're doing a giant transaction, or perhaps I have the scale wrong and it's the BTree cached index memory that is eating up all your working memory.
The concept of a working set, is the number of distinct 'pages' touched during a transaction (or set of transactions) In elephant every unique slot access will hit a different page, but every slot access that is nearby in the BTree index will or may share storage. However import by it's very nature is a linear operation, there is (roughly) no locality as every record is new - so you'll be allocating lots of new pages and re-balancing the BTrees quite a bit. Until I have a better sense of how you are using transactions it's harder to be more helpful. My own DB is about 6GB but I've built it up over a long time with alot of large records. Thanks, Ian Red Daly wrote: > I was importing into sleepycat using standard elephant routines. I am > not aware of an 'import mode' for sleepycat, but I will look into that > when I have a chance. Another consideration using sleepycat is that > using BTrees with a large working set demands large amounts of memory > relative to a Hash representation. I am unfamiliar with the internals > of elephant and sleepycat, but it feels like the basic access method > is restricting performance, which seems to be described here: > http://www.sleepycat.com/docs/gsg/C/accessmethods.html#BTreeVSHash > > My problem so far has been importing the data, which goes very fast > until sleepycat requires extensive disk access. The in-memory rate is > reasonable and would complete in a few hours. However, once disk > operations begin the import speed suggests it would take many days to > complete. I have yet to perform extensive benchmarks, but I estimate > the instantiation rate shifts from 1800 persistent class > instantiations /second to 120 / s. > > here are the two approaches that I hypothesize may help performance. > I am admittedly unaware of innards of the two systems in question, so > you expert developers will know best. If either sounds appropriate or > you envision another possibility for allowing this kind of scaling, I > will look into implementing such a system. > > 1. decreasing the size of the working set is one possibility for > decreasing run-time memory requirements and disk access. I'm not sure > how the concept of a 'working set' translates from the sleepycat world > to the elephant world, but perhaps you do. > > 2. using a Hash instead of a BTree in the primary database? I am > unsure what this means for elephant. > > In the mean time I will depart from the every-class-is-persistent > approach and also use more traditional data structures. > > Thanks again, > Red Daly > > > > Robert L. Read wrote: >> Yes, it's amusing. >> >> In my own work I use the Postgres backend; I know very little about >> SleepyCat. It seems >> to me that this is more of a SleepyCat issue, then an Elephant >> issue. Perhaps you should >> ask the SleepyCat list? >> >> Are you importing things into SleepyCat directly in the correct >> serialization format that >> they can be read by Elephant? If so, I assume it is just a question >> of solving the SleepyCat >> problems. >> >> An alternative would be to use the SQL-based backend. However, I >> doubt this will solve >> your problem, since at present we (well, I wrote it) use a very >> inefficient serialization scheme >> for the SQL-based backend that base64 encodes everything. This had >> the advantage that >> it makes it work trouble-free with different database backends, but >> could clearly be improved upon. >> However, it is more than efficient enough for all my work, and at >> present nobody is clamoring >> to have it improved. >> >> Is your problem importing the data or using it once it is imported? >> It's hard for me to imagine >> a problem so large that even the import time is a problem --- suppose >> it takes 24 hours --- can >> you not afford to pay that? >> >> A drastic measure and potentially expensive measure would be to >> switch to a 64-bit architecture >> with a huge memory. I intend to do that when forced by performance >> issues in my own work. >> >> >> >> On Tue, 2006-10-10 at 00:46 -0700, Red Daly wrote: >>> I will be running experiments in informatics and modeling in the >>> future that may contain (tens or hundreds of) millions of objects. >>> Given the ease of use of elephant so far, it would be great to use >>> it as the persistent store and avoid creating too many custom data >>> structures. >>> >>> I have recently run up against some performance bottlenecks when >>> using elephant to work with very large datasets (in the hundreds of >>> millions of objects). Using SleepyCat, I am able to import data >>> very quickly with a DB_CONFIG file with the following contents: >>> >>> set_lk_max_locks 500000 >>> set_lk_max_objects 500000 >>> set_lk_max_lockers 500000 >>> set_cachesize 1 0 0 >>> >>> I can import data very quickly until the 1 gb cache is too small to >>> allow complete in-memory access to the database. at this point it >>> seems that disk IO makes additional writes happen much slower. (I >>> have also tried increasing the 1 gb cache size, but the database >>> fails to open if it is too large--e.g. 2 gbs. I have 1.25 gb >>> physical memory and 4 gb swap, so the constraint seems to be >>> physical memory.) the max_lock, etc. lines allow transactions to >>> contain hundreds of thousands of individual locks, limiting the >>> transaction throughput bottleneck >>> >>> What are the technical restrictions on writing several million >>> objects to the datastore? Is it feasible to create a batch import >>> feature to allow large datasets to be imported using reasonable >>> amounts of memory for a desktop computer? >>> >>> I hope this email is at least amusing! >>> >>> Thanks again, >>> red daly >>> _______________________________________________ >>> elephant-devel site list >>> elephant-devel@common-lisp.net <mailto:elephant-devel@common-lisp.net> >>> http://common-lisp.net/mailman/listinfo/elephant-devel >>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> elephant-devel site list >> elephant-devel@common-lisp.net >> http://common-lisp.net/mailman/listinfo/elephant-devel > > _______________________________________________ > elephant-devel site list > elephant-devel@common-lisp.net > http://common-lisp.net/mailman/listinfo/elephant-devel _______________________________________________ elephant-devel site list elephant-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/elephant-devel