Hi Shi, Nobody will be able to give you the precise answer, obviously. The best way is to try. You didn't say what response time is desirable nor what kind of hardware you will be using.
I wouldn't bother with the Berkeley DB-backed Lucene index for now, just use the regular one (maybe use non-compound format). If you need to partition your index, MultiSearcher will help you search all your indices, and its Parallel cousin will let you parallelize those searches. It sounds like rsync will work, but you'll have to make sure that the segments file gets rsynced last. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: shai deljo <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, February 20, 2007 5:51:13 AM Subject: Using Lucene - Design Question Hi, I have no experience with Lucene and I'm trying to collect some information in order to determine what solution is best for me. I need to index ~50M documents (starting with 10M), the size of each document is ~2k-~5k and I'll index a couple of fields per document. I expect ~20 queries per seconds and each query is ~4 terms. Update rate - not sure what is best and/or possible strategy based on performance, i.e. incremental indexing vs. pushing a full index but as far as the product is concerned most data can be updated daily, the head (let's say 20%) needs hourly (or at least on the order of hours) update. I also need to be able to override the scoring/ranking and inject my own logic and of course my main concern is response time, especially since i have additional computation on the hits before returning the results. BTW, for the additional ranking/computation i will need to retrieve values that are mapped by a term-field key, i.e. i can't know the key until i have the result and the query in my hands. i figured i would use Oracle Berkeley DB Java edition in order to keep the calls as much as possible in the memory -> any advise on this as well ? For these requirements, do i need to worry about partitioning the Index? If i do partition it, is there a solution to merge the results back or do i need to do it on my own (does Solr do it for me and if it does, can i override the scoring there)? AS far as serving multiple users, will a simple rsync of the index between multiple nodes running the same index (i am not that sensitive to data integrity) work or do i need to look at something like terracotta? In short, i am looking for the simplest solution. Thanks in advance. Shi --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]