Hi Don't use JDBM if you want a BTree it is too slow... Been there. Use Xindice's Filer class: org.apache.xindice.core.filer.Filer and instantiate a org.apache.xindice.core.filer.BTreeFiler which is a whole lot faster.
JDBM and Xindice are comparable in reading speed but surely not in writing. Check out my implementation of the BTree: http://dev.tailsweep.com/svn/abstractcache/trunk/src/main/java/org/tailsweep/abstractcache/disk/xindice/TreeCache.java Kindly //Marcus On Wed, Jul 30, 2008 at 10:44 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > A possible open source solution using a page based database would be to > store the documents in http://jdbm.sourceforge.net/ which offers BTree, > Hash, and raw page based access. One would use a primary key type of > persistent ID to lookup the document data from JDBM. > > Would be a good Lucene project to implement and I think a good solution for > Ocean LUCENE-1313. Storing documents in Lucene is fine but for a realtime > search index with many documents being deleted a lot of garbage builds up. > Frequent merging of documents files becomes IO intensive. > > Of course one issue with JDBM which I am not sure other SQL page based > systems do is load individual fields directly from disk rather than load > the > entire page into RAM first, then load the pages. Maybe it does not matter. > > On Mon, Jul 28, 2008 at 9:53 PM, John Evans <[EMAIL PROTECTED]> wrote: > > > Hi All, > > > > I have successfully used Lucene in the "tradtiional" way to provide > > full-text search for various websites. Now I am tasked with developing a > > data-store to back a web crawler. The crawler can be configured to > > retrieve > > arbitrary fields from arbitrary pages, so the result is that each > document > > may have a random assortment of fields. It seems like Lucene may be a > > natural fit for this scenario since you can obviously add arbitrary > fields > > to each document and you can store the actually data in the database. > I've > > done some research to make sure that it would meet all of our individual > > requirements (that we can iterate over documents, update (delete/replace) > > documents, etc.) and everything looks good. I've also seen a couple of > > references around the net to other people trying similar things... > however, > > I know it's not meant to be used this way, so I thought I would post here > > and ask for guidance? Has anyone done something similar? Is there any > > specific reason to think this is a bad idea? > > > > The one thing that I am least certain about his how well it will scale. > We > > may reach the point where we have tens of millions of documents and a > high > > percentage of those documents may be relatively large (10k-50k each). We > > actually would NOT be expecting/needing Lucene's normal extreme fast text > > search times for this, but we would need reasonable times for adding new > > documents to the index, retrieving documents by ID (for iterating over > all > > documents), optimizing the index after a series of changes, etc. > > > > Any advice/input/theories anyone can contribute would be greatly > > appreciated. > > > > Thanks, > > - > > John > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/