You might want to take a look at RemoteSearchable ( http://lucene.apache.org/java/2_9_2/api/contrib-remote/org/apache/lucene/search/RemoteSearchable.html) -- it'll be helpful if you place shards on different servers.
On Tue, Aug 10, 2010 at 6:08 PM, Shelly_Singh <shelly_si...@infosys.com>wrote: > - shard into 10 indices. If you need the concept of a date range search, I > would assign the documents to the shard by date, otherwise random assignment > is fine. > Shelly - Actually my documents are originally database records with each > being equally important. > > - have a pool of IndexSearchers for each index > - when a search comes in, allocate a Searcher from each index to the > search. > > - perform the search in parallel across all indices. > Shelly - Is it different from MultiSearcher or ParallelMultiSearcher > > - merge the results in your own code using an efficient merging algorithm. > > -----Original Message----- > From: Dan OConnor [mailto:docon...@acquiremedia.com] > Sent: Tuesday, August 10, 2010 6:02 PM > To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Shelly: > > You wouldn't necessarily have to use a multisearcher. A suggested > alternative is: > > - shard into 10 indices. If you need the concept of a date range search, I > would assign the documents to the shard by date, otherwise random assignment > is fine. > - have a pool of IndexSearchers for each index > - when a search comes in, allocate a Searcher from each index to the > search. > - perform the search in parallel across all indices. > - merge the results in your own code using an efficient merging algorithm. > > Regards, > Dan > > > > > -----Original Message----- > From: Shelly_Singh [mailto:shelly_si...@infosys.com] > Sent: Tuesday, August 10, 2010 8:20 AM > To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > No sort. I will need relevance based on TF. If I shard, I will have to > search in al indices. > > -----Original Message----- > From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] > Sent: Tuesday, August 10, 2010 1:54 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to > sort on relevance? Can you shard and restrict your search to a limited set > of indexes functionally? > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry(r) > > -----Original Message----- > From: Shelly_Singh <shelly_si...@infosys.com> > Date: Tue, 10 Aug 2010 13:31:38 > To: java-user@lucene.apache.org<java-user@lucene.apache.org> > Reply-To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFactor > couple of days ago, but got an OOM, so I discarded it. Later I figured that > OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I > should try a higher mergeFactor. > > With regards to the multithreaded approach, I was considering creating 10 > different threads each indexing 100mln docs coupled with a Multisearcher to > which I will feed these 10 indices. Do you think this will improve > performance. > > And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs > and search time is 15 secs.. I can live with indexing time but the search > time is highly unacceptable. > > Help again. > > -----Original Message----- > From: Anshum [mailto:ansh...@gmail.com] > Sent: Tuesday, August 10, 2010 12:55 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That seems like a reasonable data set size. I'd suggest you increase your > mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in > memory before writing it to a file (and incurring I/O). You could actually > flush by RAM usage instead of a Doc count. Turn off using the Compound file > structure for indexing as it generally takes more time creating a cfs > index. > > Plus the time would not grow linearly as the larger the size of segments > get, the more time it'd take to add more docs and merge those together > intermittently. > You may also use a multithreaded approach in case reading the source takes > time in your case, though, the indexwriter would have to be shared among > all > threads. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <shelly_si...@infosys.com > >wrote: > > > Hi, > > > > I am developing an application which uses Lucene for indexing and > searching > > 1 bln documents. (the document size is very small though. Each document > has > > a single field of 5-10 words; so I believe that my data size is within > the > > tested limits). > > > > I am using the following configuration: > > 1. 1.5 gig RAM to the jvm > > 2. 100GB disk space. > > 3. Index creation tuning factors: > > a. mergeFactor = 10 > > b. maxFieldLength = 10 > > c. maxMergeDocs = 5000000 (if I try with a larger value, I get an > > out-of-memory) > > > > With these settings, I am able to create an index of 100 million docs (10 > > pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite > > satisfactory for me, but nevertheless, I want to know what else can be > done > > to tune it further. Please help. > > Also, with these settings, can I expect the time and size to grow > linearly > > for 1bln (10 pow 9) documents? > > > > Thanks and Regards, > > > > Shelly Singh > > Center For KNowledge Driven Information Systems, Infosys > > Email: shelly_si...@infosys.com<mailto:shelly_si...@infosys.com> > > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > **************** CAUTION - Disclaimer ***************** > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > solely > for the use of the addressee(s). If you are not the intended recipient, > please > notify the sender by e-mail and delete the original message. Further, you > are not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for > any damage > you may sustain as a result of any virus in this e-mail. You should carry > out your > own virus checks before opening the e-mail or attachment. Infosys reserves > the > right to monitor and review the content of all messages sent to or from > this e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS******** End of Disclaimer ********INFOSYS*** > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Thanks and Regards, Prashant Ullegaddi, Search and Information Extraction Lab, IIIT-Hyderabad, India.