Nope, getDoc is the right way to do it. Those 3 seconds are actually spent in finding proper position to read the document from, and then IO (disk spinning, head positioning,etc).
32k documents it's quite a lot. A user won't look at all these documents, at least not all at once. Maybe you could add paging, returning a page of 1000 will cut your retrieval time proportionally to ~100msec. If you use result in some kind of post-processing, maybe you can rework your code, use some kind of queue, so you can start serving documents as soon as possible, and the post-processing thread won't wait until all results are available. On Mon, Aug 16, 2010 at 10:12, Shelly_Singh <shelly_si...@infosys.com> wrote: > Hi, > > While I could get an excellent search time on 1 bln documents in lucene; when > I try to retrieve the document, I am being faced by a problem. If the number > of documents returned by lucene is large (in my example it is 32000), then > the document retrieval time is 3 seconds. > > My lucene document is not big, it has 3 fields of 1-2 terms each. > From my code, I could see that most of those 3 seconds go in > "reader.getDoc(docId)". > Is there is a better way to do this. > > Thanks and Regards, > > Shelly Singh > Center For KNowledge Driven Information Systems, Infosys > Email: shelly_si...@infosys.com > Phone: (M) 91 992 369 7200, (VoIP)2022978622 > > -----Original Message----- > From: Anshum [mailto:ansh...@gmail.com] > Sent: Wednesday, August 11, 2010 10:38 AM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > So, you didn't really use the setRamBuffer.. ? > Any reasons for that? > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh > <shelly_si...@infosys.com>wrote: > >> My final settings are: >> 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop >> 2. 100GB disk space. >> 3. Index creation and searching tuning factors: >> a. mergeFactor = 10 >> b. maxFieldLength = 10 >> c. maxMergeDocs = 5000000 >> d. full optimize at end of index creation >> e. readChunkSize = 1000000 >> f. TermInfosIndexDivisor = 10 >> g. NO sharding. Single Machine. >> >> But Pablo, my document is a single field document with the the field length >> being 2-5 words. So, u can probably reduce it by a factor of 100 directly if >> u want to compare with regular docs. >> >> -----Original Message----- >> From: Pablo Mendes [mailto:pablomen...@gmail.com] >> Sent: Tuesday, August 10, 2010 7:22 PM >> To: java-user@lucene.apache.org >> Subject: Re: Scaling Lucene to 1bln docs >> >> Shelly, >> Do you mind sharing with the list the final settings you used for your best >> results? >> >> Cheers, >> Pablo >> >> On Tue, Aug 10, 2010 at 3:49 PM, anshum.gu...@naukri.com >> <ansh...@gmail.com>wrote: >> >> > Hey Shelly, >> > If you want to get more info on lucene, I'd recommend you get a copy of >> > lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) >> > >> > -- >> > Anshum >> > http://blog.anshumgupta.net >> > >> > Sent from BlackBerry® >> > >> > -----Original Message----- >> > From: Shelly_Singh <shelly_si...@infosys.com> >> > Date: Tue, 10 Aug 2010 19:11:11 >> > To: java-user@lucene.apache.org<java-user@lucene.apache.org> >> > Reply-To: java-user@lucene.apache.org >> > Subject: RE: Scaling Lucene to 1bln docs >> > >> > Hi folks, >> > >> > Thanks for the excellent support n guidance on my very first day on this >> > mailing list... >> > At end of day, I have very optimistic results. 100bln search in less than >> > 1ms and the index creation time is not huge either ( close to 15 >> minutes). >> > >> > I am now hitting the 1bln mark with roughly the same settings. But, I >> want >> > to understand Norms and TermFilters. >> > >> > Can someone explain, why or why not should one use each of these and what >> > tradeoffs does it have. >> > >> > Regards, >> > Shelly >> > >> > -----Original Message----- >> > From: Danil ŢORIN [mailto:torin...@gmail.com] >> > Sent: Tuesday, August 10, 2010 6:52 PM >> > To: java-user@lucene.apache.org >> > Subject: Re: Scaling Lucene to 1bln docs >> > >> > That won't work...if you'll have something like "A Basic Crazy >> > Document E-something F-something G-something....you get the point" it >> > will go to all shards so the whole point of shards will be >> > compromised...you'll have 26 billion documents index ;) >> > >> > Looks like the only way is to search all shards. >> > Depending on available hardware (1 Azul...50 EC2), expected >> > traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), >> > redundancy (it's a large dataset, I don't think you want to loose it), >> > and so on...you'll have to decide how many partitions do you want. >> > >> > It may work with 8-10, it may need 50-64. (I usually use 2^n as it's >> > easier to split each shard in 2 when index grows too much) >> > >> > On such large datasets it's a lot of tuning, custom code, and no >> > one-size-fits-all solution. >> > Lucene is just a tool (a fine one) but you need to use it wisely to >> > archive great results. >> > >> > On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <shelly_si...@infosys.com> >> > wrote: >> > > Hmm..I get the point. But, in my application, the document is basically >> a >> > descriptive name of a particular thing. The user will search by name (or >> > part of name) and I need to pull out all info pointed to by that name. >> This >> > info is externalized in a db. >> > > >> > > One option I can think of is- >> > > I can shard based on starting alphabet of any name. So, "Alan Mathur of >> > New Delhi" may go to shard "A". But since the name will have 'n' tokens, >> and >> > the user may type any one token, this will not work. I can further tweak >> > this such that I index the same document into multiple indices (one for >> each >> > token). So, the same document may be indexed into Shard"A", "M", "N" and >> > "D". >> > > I am not able to think of another option. >> > > >> > > Comments welcome. >> > > >> > > >> > > -----Original Message----- >> > > From: Danil ŢORIN [mailto:torin...@gmail.com] >> > > Sent: Tuesday, August 10, 2010 6:11 PM >> > > To: java-user@lucene.apache.org >> > > Subject: Re: Scaling Lucene to 1bln docs >> > > >> > > I'd second that. >> > > >> > > It doesn't have to be date for sharding. Maybe every query has some >> > > specific field, like UserId or something, so you can redirect to >> > > specific shard instead of hitting all 10 indices. >> > > >> > > You have to have some kind of narrowing: searching 1bn documents with >> > > queries that may hit all documents is useless. >> > > An user won't look on more than let say 100 results (if presented >> > > properly..maybe 1000) >> > > >> > > Those fields that narrow the result set are good candidates for >> sharding >> > keys. >> > > >> > > >> > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <docon...@acquiremedia.com> >> > wrote: >> > >> Shelly: >> > >> >> > >> You wouldn't necessarily have to use a multisearcher. A suggested >> > alternative is: >> > >> >> > >> - shard into 10 indices. If you need the concept of a date range >> search, >> > I would assign the documents to the shard by date, otherwise random >> > assignment is fine. >> > >> - have a pool of IndexSearchers for each index >> > >> - when a search comes in, allocate a Searcher from each index to the >> > search. >> > >> - perform the search in parallel across all indices. >> > >> - merge the results in your own code using an efficient merging >> > algorithm. >> > >> >> > >> Regards, >> > >> Dan >> > >> >> > >> >> > >> >> > >> >> > >> -----Original Message----- >> > >> From: Shelly_Singh [mailto:shelly_si...@infosys.com] >> > >> Sent: Tuesday, August 10, 2010 8:20 AM >> > >> To: java-user@lucene.apache.org >> > >> Subject: RE: Scaling Lucene to 1bln docs >> > >> >> > >> No sort. I will need relevance based on TF. If I shard, I will have to >> > search in al indices. >> > >> >> > >> -----Original Message----- >> > >> From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] >> > >> Sent: Tuesday, August 10, 2010 1:54 PM >> > >> To: java-user@lucene.apache.org >> > >> Subject: Re: Scaling Lucene to 1bln docs >> > >> >> > >> Would like to know, are you using a particular type of sort? Do you >> need >> > to sort on relevance? Can you shard and restrict your search to a limited >> > set of indexes functionally? >> > >> >> > >> -- >> > >> Anshum >> > >> http://blog.anshumgupta.net >> > >> >> > >> Sent from BlackBerry(r) >> > >> >> > >> -----Original Message----- >> > >> From: Shelly_Singh <shelly_si...@infosys.com> >> > >> Date: Tue, 10 Aug 2010 13:31:38 >> > >> To: java-user@lucene.apache.org<java-user@lucene.apache.org> >> > >> Reply-To: java-user@lucene.apache.org >> > >> Subject: RE: Scaling Lucene to 1bln docs >> > >> >> > >> Hi Anshum, >> > >> >> > >> I am already running with the 'setCompoundFile' option off. >> > >> And thanks for pointing out mergeFactor. I had tried a higher >> > mergeFactor couple of days ago, but got an OOM, so I discarded it. Later >> I >> > figured that OOM was because maxMergeDocs was unlimited and I was using >> > MMap. U r rigt, I should try a higher mergeFactor. >> > >> >> > >> With regards to the multithreaded approach, I was considering creating >> > 10 different threads each indexing 100mln docs coupled with a >> Multisearcher >> > to which I will feed these 10 indices. Do you think this will improve >> > performance. >> > >> >> > >> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 >> > hrs and search time is 15 secs.. I can live with indexing time but the >> > search time is highly unacceptable. >> > >> >> > >> Help again. >> > >> >> > >> -----Original Message----- >> > >> From: Anshum [mailto:ansh...@gmail.com] >> > >> Sent: Tuesday, August 10, 2010 12:55 PM >> > >> To: java-user@lucene.apache.org >> > >> Subject: Re: Scaling Lucene to 1bln docs >> > >> >> > >> Hi Shelly, >> > >> That seems like a reasonable data set size. I'd suggest you increase >> > your >> > >> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 >> docs >> > in >> > >> memory before writing it to a file (and incurring I/O). You could >> > actually >> > >> flush by RAM usage instead of a Doc count. Turn off using the Compound >> > file >> > >> structure for indexing as it generally takes more time creating a cfs >> > index. >> > >> >> > >> Plus the time would not grow linearly as the larger the size of >> segments >> > >> get, the more time it'd take to add more docs and merge those together >> > >> intermittently. >> > >> You may also use a multithreaded approach in case reading the source >> > takes >> > >> time in your case, though, the indexwriter would have to be shared >> among >> > all >> > >> threads. >> > >> >> > >> -- >> > >> Anshum Gupta >> > >> http://ai-cafe.blogspot.com >> > >> >> > >> >> > >> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh < >> > shelly_si...@infosys.com>wrote: >> > >> >> > >>> Hi, >> > >>> >> > >>> I am developing an application which uses Lucene for indexing and >> > searching >> > >>> 1 bln documents. (the document size is very small though. Each >> document >> > has >> > >>> a single field of 5-10 words; so I believe that my data size is >> within >> > the >> > >>> tested limits). >> > >>> >> > >>> I am using the following configuration: >> > >>> 1. 1.5 gig RAM to the jvm >> > >>> 2. 100GB disk space. >> > >>> 3. Index creation tuning factors: >> > >>> a. mergeFactor = 10 >> > >>> b. maxFieldLength = 10 >> > >>> c. maxMergeDocs = 5000000 (if I try with a larger value, I get >> an >> > >>> out-of-memory) >> > >>> >> > >>> With these settings, I am able to create an index of 100 million docs >> > (10 >> > >>> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite >> > >>> satisfactory for me, but nevertheless, I want to know what else can >> be >> > done >> > >>> to tune it further. Please help. >> > >>> Also, with these settings, can I expect the time and size to grow >> > linearly >> > >>> for 1bln (10 pow 9) documents? >> > >>> >> > >>> Thanks and Regards, >> > >>> >> > >>> Shelly Singh >> > >>> Center For KNowledge Driven Information Systems, Infosys >> > >>> Email: shelly_si...@infosys.com<mailto:shelly_si...@infosys.com> >> > >>> Phone: (M) 91 992 369 7200, (VoIP)2022978622 >> > >>> >> > >>> >> > >>> >> > >>> >> > >> >> > >> --------------------------------------------------------------------- >> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> > >> >> > >> --------------------------------------------------------------------- >> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> > >> >> > >> --------------------------------------------------------------------- >> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> > >> >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > >> > > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > **************** CAUTION - Disclaimer ***************** >> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended >> > solely >> > for the use of the addressee(s). If you are not the intended recipient, >> > please >> > notify the sender by e-mail and delete the original message. Further, you >> > are not >> > to copy, disclose, or distribute this e-mail or its contents to any other >> > person and >> > any such actions are unlawful. This e-mail may contain viruses. Infosys >> has >> > taken >> > every reasonable precaution to minimize this risk, but is not liable for >> > any damage >> > you may sustain as a result of any virus in this e-mail. You should carry >> > out your >> > own virus checks before opening the e-mail or attachment. Infosys >> reserves >> > the >> > right to monitor and review the content of all messages sent to or from >> > this e-mail >> > address. Messages sent to or from this e-mail address may be stored on >> the >> > Infosys e-mail system. >> > ***INFOSYS******** End of Disclaimer ********INFOSYS*** >> > >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org