Re: Scaling Lucene to 1bln docs

prashant ullegaddi Tue, 10 Aug 2010 05:44:30 -0700

You might want to take a look at RemoteSearchable (
http://lucene.apache.org/java/2_9_2/api/contrib-remote/org/apache/lucene/search/RemoteSearchable.html)
-- it'll be helpful if you place shards on different servers.


On Tue, Aug 10, 2010 at 6:08 PM, Shelly_Singh <shelly_si...@infosys.com>wrote:

> - shard into 10 indices. If you need the concept of a date range search, I
> would assign the documents to the shard by date, otherwise random assignment
> is fine.
> Shelly - Actually my documents are originally database records with each
> being equally important.
>
> - have a pool of IndexSearchers for each index
> - when a search comes in, allocate a Searcher from each index to the
> search.
>
> - perform the search in parallel across all indices.
> Shelly - Is it different from MultiSearcher or ParallelMultiSearcher
>
> - merge the results in your own code using an efficient merging algorithm.
>
> -----Original Message-----
> From: Dan OConnor [mailto:docon...@acquiremedia.com]
> Sent: Tuesday, August 10, 2010 6:02 PM
> To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Shelly:
>
> You wouldn't necessarily have to use a multisearcher. A suggested
> alternative is:
>
> - shard into 10 indices. If you need the concept of a date range search, I
> would assign the documents to the shard by date, otherwise random assignment
> is fine.
> - have a pool of IndexSearchers for each index
> - when a search comes in, allocate a Searcher from each index to the
> search.
> - perform the search in parallel across all indices.
> - merge the results in your own code using an efficient merging algorithm.
>
> Regards,
> Dan
>
>
>
>
> -----Original Message-----
> From: Shelly_Singh [mailto:shelly_si...@infosys.com]
> Sent: Tuesday, August 10, 2010 8:20 AM
> To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> No sort. I will need relevance based on TF. If I shard, I will have to
> search in al indices.
>
> -----Original Message-----
> From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
> Sent: Tuesday, August 10, 2010 1:54 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Would like to know, are you using a particular type of sort? Do you need to
> sort on relevance? Can you shard and restrict your search to a limited set
> of indexes functionally?
>
> --
> Anshum
> http://blog.anshumgupta.net
>
> Sent from BlackBerry(r)
>
> -----Original Message-----
> From: Shelly_Singh <shelly_si...@infosys.com>
> Date: Tue, 10 Aug 2010 13:31:38
> To: java-user@lucene.apache.org<java-user@lucene.apache.org>
> Reply-To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Hi Anshum,
>
> I am already running with the 'setCompoundFile' option off.
> And thanks for pointing out mergeFactor. I had tried a higher mergeFactor
> couple of days ago, but got an OOM, so I discarded it. Later I figured that
> OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I
> should try a higher mergeFactor.
>
> With regards to the multithreaded approach, I was considering creating 10
> different threads each indexing 100mln docs coupled with a Multisearcher to
> which I will feed these 10 indices. Do you think this will improve
> performance.
>
> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs
> and search time is 15 secs.. I can live with indexing time but the search
> time is highly unacceptable.
>
> Help again.
>
> -----Original Message-----
> From: Anshum [mailto:ansh...@gmail.com]
> Sent: Tuesday, August 10, 2010 12:55 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Hi Shelly,
> That seems like a reasonable data set size. I'd suggest you increase your
> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
> memory before writing it to a file (and incurring I/O). You could actually
> flush by RAM usage instead of a Doc count. Turn off using the Compound file
> structure for indexing as it generally takes more time creating a cfs
> index.
>
> Plus the time would not grow linearly as the larger the size of segments
> get, the more time it'd take to add more docs and merge those together
> intermittently.
> You may also use a multithreaded approach in case reading the source takes
> time in your case, though, the indexwriter would have to be shared among
> all
> threads.
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.com
>
>
> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <shelly_si...@infosys.com
> >wrote:
>
> > Hi,
> >
> > I am developing an application which uses Lucene for indexing and
> searching
> > 1 bln documents. (the document size is very small though. Each document
> has
> > a single field of 5-10 words; so I believe that my data size is within
> the
> > tested limits).
> >
> > I am using the following configuration:
> > 1.      1.5 gig RAM to the jvm
> > 2.      100GB disk space.
> > 3.      Index creation tuning factors:
> > a.      mergeFactor = 10
> > b.      maxFieldLength = 10
> > c.      maxMergeDocs = 5000000 (if I try with a larger value, I get an
> > out-of-memory)
> >
> > With these settings, I am able to create an index of 100 million docs (10
> > pow 8)  in 15 mins consuming a disk space of 2.5gb. Which is quite
> > satisfactory for me, but nevertheless, I want to know what else can be
> done
> > to tune it further. Please help.
> > Also, with these settings, can I expect the time and size to grow
> linearly
> > for 1bln (10 pow 9) documents?
> >
> > Thanks and Regards,
> >
> > Shelly Singh
> > Center For KNowledge Driven Information Systems, Infosys
> > Email: shelly_si...@infosys.com<mailto:shelly_si...@infosys.com>
> > Phone: (M) 91 992 369 7200, (VoIP)2022978622
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Thanks and Regards,
Prashant Ullegaddi,
Search and Information Extraction Lab,
IIIT-Hyderabad, India.

Re: Scaling Lucene to 1bln docs

Reply via email to