Some experiments performed on an old topic...
On Oct 11, 2006, at 8:57 PM, Red Daly wrote:
I was importing into sleepycat using standard elephant routines. I
am not aware of an 'import mode' for sleepycat, but I will look
into that when I have a chance. Another consideration using
sleepycat is that using BTrees with a large working set demands
large amounts of memory relative to a Hash representation. I am
unfamiliar with the internals of elephant and sleepycat, but it
feels like the basic access method is restricting performance,
which seems to be described here:
http://www.sleepycat.com/docs/gsg/C/accessmethods.html#BTreeVSHash
My problem so far has been importing the data, which goes very fast
until sleepycat requires extensive disk access. The in-memory rate
is reasonable and would complete in a few hours. However, once
disk operations begin the import speed suggests it would take many
days to complete. I have yet to perform extensive benchmarks, but
I estimate the instantiation rate shifts from 1800 persistent class
instantiations /second to 120 / s.
The biggest performance factor is properly managing transaction sizes
to balance contention, total locks used, etc.
You can also turn off all transactional and log synchronization since
if you crash, you can always restart a several hour download. I
think this may avoid additional overhead, however I have not
benchmarked this. i.e.
(with-transaction (:txn-nosync t :dirty-read t)
(create 500 objects))
here are the two approaches that I hypothesize may help
performance. I am admittedly unaware of innards of the two systems
in question, so you expert developers will know best. If either
sounds appropriate or you envision another possibility for allowing
this kind of scaling, I will look into implementing such a system.
1. decreasing the size of the working set is one possibility for
decreasing run-time memory requirements and disk access. I'm not
sure how the concept of a 'working set' translates from the
sleepycat world to the elephant world, but perhaps you do.
What do you mean by working set? When loading stuff into a database
you are moving index pages around in the btree and allocating endless
amounts of leaf nodes. The index nodes are cacheable, but the leaf
nodes are definitely not! I think there are ways to add a bunch of
objects and then force a btree to update all the index pages.
Access to that functionality, if BDB even supports it, is not
provided in elephant.
2. using a Hash instead of a BTree in the primary database? I am
unsure what this means for elephant.
I finally got around to trying this and it showed poorer performance
on a large stress test (create, modify and access 10k objects). I
don't have a good theory as to why it was slower other than in create
where the hash table had to grow.
In the mean time I will depart from the every-class-is-persistent
approach and also use more traditional data structures.
Thanks again,
Red Daly
Robert L. Read wrote:
Yes, it's amusing.
In my own work I use the Postgres backend; I know very little
about SleepyCat. It seems
to me that this is more of a SleepyCat issue, then an Elephant
issue. Perhaps you should
ask the SleepyCat list?
Are you importing things into SleepyCat directly in the correct
serialization format that
they can be read by Elephant? If so, I assume it is just a
question of solving the SleepyCat
problems.
An alternative would be to use the SQL-based backend. However, I
doubt this will solve
your problem, since at present we (well, I wrote it) use a very
inefficient serialization scheme
for the SQL-based backend that base64 encodes everything. This
had the advantage that
it makes it work trouble-free with different database backends,
but could clearly be improved upon.
However, it is more than efficient enough for all my work, and at
present nobody is clamoring
to have it improved.
Is your problem importing the data or using it once it is
imported? It's hard for me to imagine
a problem so large that even the import time is a problem ---
suppose it takes 24 hours --- can
you not afford to pay that?
A drastic measure and potentially expensive measure would be to
switch to a 64-bit architecture
with a huge memory. I intend to do that when forced by
performance issues in my own work.
On Tue, 2006-10-10 at 00:46 -0700, Red Daly wrote:
I will be running experiments in informatics and modeling in the
future that may contain (tens or hundreds of) millions of
objects. Given the ease of use of elephant so far, it would be
great to use it as the persistent store and avoid creating too
many custom data structures.
I have recently run up against some performance bottlenecks when
using elephant to work with very large datasets (in the hundreds
of millions of objects). Using SleepyCat, I am able to import
data very quickly with a DB_CONFIG file with the following contents:
set_lk_max_locks 500000
set_lk_max_objects 500000
set_lk_max_lockers 500000
set_cachesize 1 0 0
I can import data very quickly until the 1 gb cache is too small
to allow complete in-memory access to the database. at this
point it seems that disk IO makes additional writes happen much
slower. (I have also tried increasing the 1 gb cache size, but
the database fails to open if it is too large--e.g. 2 gbs. I
have 1.25 gb physical memory and 4 gb swap, so the constraint
seems to be physical memory.) the max_lock, etc. lines allow
transactions to contain hundreds of thousands of individual
locks, limiting the transaction throughput bottleneck
What are the technical restrictions on writing several million
objects to the datastore? Is it feasible to create a batch
import feature to allow large datasets to be imported using
reasonable amounts of memory for a desktop computer?
I hope this email is at least amusing!
Thanks again,
red daly
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net <mailto:[EMAIL PROTECTED]
lisp.net>
http://common-lisp.net/mailman/listinfo/elephant-devel
---------------------------------------------------------------------
---
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel