Re: Scaling Lucene to 1bln docs

Danil ŢORIN Mon, 16 Aug 2010 02:33:31 -0700

Nope, getDoc is the right way to do it.

Those 3 seconds are actually spent in finding proper position to read
the document from, and then IO (disk spinning, head positioning,etc).


32k documents it's quite a lot. A user won't look at all these
documents, at least not all at once.

Maybe you could add paging, returning a page of 1000 will cut your
retrieval time proportionally to ~100msec.

If you use result in some kind of post-processing, maybe you can
rework your code, use some kind of queue,
so you can start serving documents as soon as possible, and the
post-processing thread won't wait until all results are available.


On Mon, Aug 16, 2010 at 10:12, Shelly_Singh <shelly_si...@infosys.com> wrote:
> Hi,
>
> While I could get an excellent search time on 1 bln documents in lucene; when 
> I try to retrieve the document, I am being faced by a problem. If the number 
> of documents returned by lucene is large (in my example it is 32000), then 
> the document retrieval time is 3 seconds.
>
> My lucene document is not big, it has 3 fields of 1-2 terms each.
> From my code, I could see that most of those 3 seconds go in 
> "reader.getDoc(docId)".
> Is there is a better way to do this.
>
> Thanks and Regards,
>
> Shelly Singh
> Center For KNowledge Driven Information Systems, Infosys
> Email: shelly_si...@infosys.com
> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>
> -----Original Message-----
> From: Anshum [mailto:ansh...@gmail.com]
> Sent: Wednesday, August 11, 2010 10:38 AM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> So, you didn't really use the setRamBuffer.. ?
> Any reasons for that?
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.com
>
>
> On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh 
> <shelly_si...@infosys.com>wrote:
>
>> My final settings are:
>> 1.      1.5 gig RAM to the jvm out of 2GB available for my desktop
>> 2.      100GB disk space.
>> 3.      Index creation and searching tuning factors:
>>         a. mergeFactor = 10
>>        b. maxFieldLength = 10
>>        c. maxMergeDocs = 5000000
>>         d. full optimize at end of index creation
>>        e. readChunkSize = 1000000
>>        f. TermInfosIndexDivisor = 10
>>        g. NO sharding. Single Machine.
>>
>> But Pablo, my document is a single field document with the the field length
>> being 2-5 words. So, u can probably reduce it by a factor of 100 directly if
>> u want to compare with regular docs.
>>
>> -----Original Message-----
>> From: Pablo Mendes [mailto:pablomen...@gmail.com]
>> Sent: Tuesday, August 10, 2010 7:22 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Scaling Lucene to 1bln docs
>>
>> Shelly,
>> Do you mind sharing with the list the final settings you used for your best
>> results?
>>
>> Cheers,
>> Pablo
>>
>> On Tue, Aug 10, 2010 at 3:49 PM, anshum.gu...@naukri.com
>> <ansh...@gmail.com>wrote:
>>
>> > Hey Shelly,
>> > If you want to get more info on lucene, I'd recommend you get a copy of
>> > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :)
>> >
>> > --
>> > Anshum
>> > http://blog.anshumgupta.net
>> >
>> > Sent from BlackBerry®
>> >
>> > -----Original Message-----
>> > From: Shelly_Singh <shelly_si...@infosys.com>
>> > Date: Tue, 10 Aug 2010 19:11:11
>> > To: java-user@lucene.apache.org<java-user@lucene.apache.org>
>> > Reply-To: java-user@lucene.apache.org
>> > Subject: RE: Scaling Lucene to 1bln docs
>> >
>> > Hi folks,
>> >
>> > Thanks for the excellent support n guidance on my very first day on this
>> > mailing list...
>> > At end of day, I have very optimistic results. 100bln search in less than
>> > 1ms and the index creation time is not huge either ( close to 15
>> minutes).
>> >
>> > I am now hitting the 1bln mark with roughly the same settings. But, I
>> want
>> > to understand Norms and TermFilters.
>> >
>> > Can someone explain, why or why not should one use each of these and what
>> > tradeoffs does it have.
>> >
>> > Regards,
>> > Shelly
>> >
>> > -----Original Message-----
>> > From: Danil ŢORIN [mailto:torin...@gmail.com]
>> > Sent: Tuesday, August 10, 2010 6:52 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: Scaling Lucene to 1bln docs
>> >
>> > That won't work...if you'll have something like "A Basic Crazy
>> > Document E-something F-something G-something....you get the point" it
>> > will go to all shards so the whole point of shards will be
>> > compromised...you'll have 26 billion documents index ;)
>> >
>> > Looks like the only way is to search all shards.
>> > Depending on available hardware (1 Azul...50 EC2), expected
>> > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec),
>> > redundancy (it's a large dataset, I don't think you want to loose it),
>> > and so on...you'll have to decide how many partitions do you want.
>> >
>> > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's
>> > easier to split each shard in 2 when index grows too much)
>> >
>> > On such large datasets it's a lot of tuning, custom code, and no
>> > one-size-fits-all solution.
>> > Lucene is just a tool (a fine one) but you need to use it wisely to
>> > archive great results.
>> >
>> > On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <shelly_si...@infosys.com>
>> > wrote:
>> > > Hmm..I get the point. But, in my application, the document is basically
>> a
>> > descriptive name of a particular thing. The user will search by name (or
>> > part of name) and I need to pull out all info pointed to by that name.
>> This
>> > info is externalized in a db.
>> > >
>> > > One option I can think of is-
>> > > I can shard based on starting alphabet of any name. So, "Alan Mathur of
>> > New Delhi" may go to shard "A". But since the name will have 'n' tokens,
>> and
>> > the user may type any one token, this will not work. I can further tweak
>> > this such that I index the same document into multiple indices (one for
>> each
>> > token). So, the same document may be indexed into Shard"A", "M", "N" and
>> > "D".
>> > > I am not able to think of another option.
>> > >
>> > > Comments welcome.
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Danil ŢORIN [mailto:torin...@gmail.com]
>> > > Sent: Tuesday, August 10, 2010 6:11 PM
>> > > To: java-user@lucene.apache.org
>> > > Subject: Re: Scaling Lucene to 1bln docs
>> > >
>> > > I'd second that.
>> > >
>> > > It doesn't have to be date for sharding. Maybe every query has some
>> > > specific field, like UserId or something, so you can redirect to
>> > > specific shard instead of hitting all 10 indices.
>> > >
>> > > You have to have some kind of narrowing: searching 1bn documents with
>> > > queries that may hit all documents is useless.
>> > > An user won't look on more than let say 100 results (if presented
>> > > properly..maybe 1000)
>> > >
>> > > Those fields that narrow the result set are good candidates for
>> sharding
>> > keys.
>> > >
>> > >
>> > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <docon...@acquiremedia.com>
>> > wrote:
>> > >> Shelly:
>> > >>
>> > >> You wouldn't necessarily have to use a multisearcher. A suggested
>> > alternative is:
>> > >>
>> > >> - shard into 10 indices. If you need the concept of a date range
>> search,
>> > I would assign the documents to the shard by date, otherwise random
>> > assignment is fine.
>> > >> - have a pool of IndexSearchers for each index
>> > >> - when a search comes in, allocate a Searcher from each index to the
>> > search.
>> > >> - perform the search in parallel across all indices.
>> > >> - merge the results in your own code using an efficient merging
>> > algorithm.
>> > >>
>> > >> Regards,
>> > >> Dan
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -----Original Message-----
>> > >> From: Shelly_Singh [mailto:shelly_si...@infosys.com]
>> > >> Sent: Tuesday, August 10, 2010 8:20 AM
>> > >> To: java-user@lucene.apache.org
>> > >> Subject: RE: Scaling Lucene to 1bln docs
>> > >>
>> > >> No sort. I will need relevance based on TF. If I shard, I will have to
>> > search in al indices.
>> > >>
>> > >> -----Original Message-----
>> > >> From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
>> > >> Sent: Tuesday, August 10, 2010 1:54 PM
>> > >> To: java-user@lucene.apache.org
>> > >> Subject: Re: Scaling Lucene to 1bln docs
>> > >>
>> > >> Would like to know, are you using a particular type of sort? Do you
>> need
>> > to sort on relevance? Can you shard and restrict your search to a limited
>> > set of indexes functionally?
>> > >>
>> > >> --
>> > >> Anshum
>> > >> http://blog.anshumgupta.net
>> > >>
>> > >> Sent from BlackBerry(r)
>> > >>
>> > >> -----Original Message-----
>> > >> From: Shelly_Singh <shelly_si...@infosys.com>
>> > >> Date: Tue, 10 Aug 2010 13:31:38
>> > >> To: java-user@lucene.apache.org<java-user@lucene.apache.org>
>> > >> Reply-To: java-user@lucene.apache.org
>> > >> Subject: RE: Scaling Lucene to 1bln docs
>> > >>
>> > >> Hi Anshum,
>> > >>
>> > >> I am already running with the 'setCompoundFile' option off.
>> > >> And thanks for pointing out mergeFactor. I had tried a higher
>> > mergeFactor couple of days ago, but got an OOM, so I discarded it. Later
>> I
>> > figured that OOM was because maxMergeDocs was unlimited and I was using
>> > MMap. U r rigt, I should try a higher mergeFactor.
>> > >>
>> > >> With regards to the multithreaded approach, I was considering creating
>> > 10 different threads each indexing 100mln docs coupled with a
>> Multisearcher
>> > to which I will feed these 10 indices. Do you think this will improve
>> > performance.
>> > >>
>> > >> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2
>> > hrs and search time is 15 secs.. I can live with indexing time but the
>> > search time is highly unacceptable.
>> > >>
>> > >> Help again.
>> > >>
>> > >> -----Original Message-----
>> > >> From: Anshum [mailto:ansh...@gmail.com]
>> > >> Sent: Tuesday, August 10, 2010 12:55 PM
>> > >> To: java-user@lucene.apache.org
>> > >> Subject: Re: Scaling Lucene to 1bln docs
>> > >>
>> > >> Hi Shelly,
>> > >> That seems like a reasonable data set size. I'd suggest you increase
>> > your
>> > >> mergeFactor as a mergeFactor of 10 says, you are only buffering 10
>> docs
>> > in
>> > >> memory before writing it to a file (and incurring I/O). You could
>> > actually
>> > >> flush by RAM usage instead of a Doc count. Turn off using the Compound
>> > file
>> > >> structure for indexing as it generally takes more time creating a cfs
>> > index.
>> > >>
>> > >> Plus the time would not grow linearly as the larger the size of
>> segments
>> > >> get, the more time it'd take to add more docs and merge those together
>> > >> intermittently.
>> > >> You may also use a multithreaded approach in case reading the source
>> > takes
>> > >> time in your case, though, the indexwriter would have to be shared
>> among
>> > all
>> > >> threads.
>> > >>
>> > >> --
>> > >> Anshum Gupta
>> > >> http://ai-cafe.blogspot.com
>> > >>
>> > >>
>> > >> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh <
>> > shelly_si...@infosys.com>wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> I am developing an application which uses Lucene for indexing and
>> > searching
>> > >>> 1 bln documents. (the document size is very small though. Each
>> document
>> > has
>> > >>> a single field of 5-10 words; so I believe that my data size is
>> within
>> > the
>> > >>> tested limits).
>> > >>>
>> > >>> I am using the following configuration:
>> > >>> 1.      1.5 gig RAM to the jvm
>> > >>> 2.      100GB disk space.
>> > >>> 3.      Index creation tuning factors:
>> > >>> a.      mergeFactor = 10
>> > >>> b.      maxFieldLength = 10
>> > >>> c.      maxMergeDocs = 5000000 (if I try with a larger value, I get
>> an
>> > >>> out-of-memory)
>> > >>>
>> > >>> With these settings, I am able to create an index of 100 million docs
>> > (10
>> > >>> pow 8)  in 15 mins consuming a disk space of 2.5gb. Which is quite
>> > >>> satisfactory for me, but nevertheless, I want to know what else can
>> be
>> > done
>> > >>> to tune it further. Please help.
>> > >>> Also, with these settings, can I expect the time and size to grow
>> > linearly
>> > >>> for 1bln (10 pow 9) documents?
>> > >>>
>> > >>> Thanks and Regards,
>> > >>>
>> > >>> Shelly Singh
>> > >>> Center For KNowledge Driven Information Systems, Infosys
>> > >>> Email: shelly_si...@infosys.com<mailto:shelly_si...@infosys.com>
>> > >>> Phone: (M) 91 992 369 7200, (VoIP)2022978622
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >>
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >>
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >>
>> > >>
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> > **************** CAUTION - Disclaimer *****************
>> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
>> > solely
>> > for the use of the addressee(s). If you are not the intended recipient,
>> > please
>> > notify the sender by e-mail and delete the original message. Further, you
>> > are not
>> > to copy, disclose, or distribute this e-mail or its contents to any other
>> > person and
>> > any such actions are unlawful. This e-mail may contain viruses. Infosys
>> has
>> > taken
>> > every reasonable precaution to minimize this risk, but is not liable for
>> > any damage
>> > you may sustain as a result of any virus in this e-mail. You should carry
>> > out your
>> > own virus checks before opening the e-mail or attachment. Infosys
>> reserves
>> > the
>> > right to monitor and review the content of all messages sent to or from
>> > this e-mail
>> > address. Messages sent to or from this e-mail address may be stored on
>> the
>> > Infosys e-mail system.
>> > ***INFOSYS******** End of Disclaimer ********INFOSYS***
>> >
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Scaling Lucene to 1bln docs

Reply via email to