Hi All, Some very promising findings... for 100 mln ( a factor of 10 less than my goal), I could bring the search speed to 'single-digit' mili seconds. The major change is that I am now optimizing the index, which I was shying from doing earlier.
For fun, I am planning to take a reading of 1 bln without sharding, and then look at sharding. Dan! I got your point about why first-letter of tokens is not a good key for deciding shards, but as I foresee, my word length will be generally between 2 to 5. So, it may still be worth a try. -----Original Message----- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August 10, 2010 6:52 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-something....you get the point" it will go to all shards so the whole point of shards will be co mpromised...you'll have 26 billion documents index ;) Looks like the only way is to search all shards. Depending on available hardware (1 Azul...50 EC2), expected traffic(1qps...1000qps), expected query time(10 msec ... 3 sec), redundancy (it's a large dataset, I don't think you want to loose it), and so on...you'll have to decide how many partitions do you want. It may work with 8-10, it may need 50-64. (I usually use 2^n as it's easier to split each shard in 2 when index grows too much) ring do On such large datasets it's a lot of tuning, custom code, and no one-size-fits-all solution. Lucene is just a tool (a fine one) but you need to use it wisely to archive great results. On Tue, Aug 10, 2010 at 15:55, Shelly_Singh <shelly_si...@infosys.com> wrote: > Hmm..I get the point. But, in my application, the document is basically a > descriptive name of a particular thing. The user will search by name (or part > of name) and I need to pull out all info pointed to by that name. This info > is externalized in a db. > > One option I can think of is- > I can shard based on starting alphabet of any name. So, "Alan Mathur of New > Delhi" may go to shard "A". But since the name will have 'n' tokens, and the > user may type any one token, this will not work. I can further tweak this > such that I index the same document into multiple indices (one for each > token). So, the same document may be indexed into Shard"A", "M", "N" and "D". > I am not able to think of another option. > > Comments welcome. > > > -----Original Message----- > From: Danil ŢORIN [mailto:torin...@gmail.com] > Sent: Tuesday, August 10, 2010 6:11 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > I'd second that. > > It doesn't have to be date for sharding. Maybe every query has some > specific field, like UserId or something, so you can redirect to > specific shard instead of hitting all 10 indices. > > You have to have some kind of narrowing: searching 1bn documents with > queries that may hit all documents is useless. > An user won't look on more than let say 100 results (if presented > properly..maybe 1000) > > Those fields that narrow the result set are good candidates for sharding keys. > > > On Tue, Aug 10, 2010 at 15:32, Dan OConnor <docon...@acquiremedia.com> wrote: >> Shelly: >> >> You wouldn't necessarily have to use a multisearcher. A suggested >> alternative is: >> >> - shard into 10 indices. If you need the concept of a date range search, I >> would assign the documents to the shard by date, otherwise random assignment >> is fine. >> - have a pool of IndexSearchers for each index >> - when a search comes in, allocate a Searcher from each index to the search. >> - perform the search in parallel across all indices. >> - merge the results in your own code using an efficient merging algorithm. >> >> Regards, >> Dan >> >> >> >> >> -----Original Message----- >> From: Shelly_Singh [mailto:shelly_si...@infosys.com] >> Sent: Tuesday, August 10, 2010 8:20 AM >> To: java-user@lucene.apache.org >> Subject: RE: Scaling Lucene to 1bln docs >> >> No sort. I will need relevance based on TF. If I shard, I will have to >> search in al indices. >> >> -----Original Message----- >> From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] >> Sent: Tuesday, August 10, 2010 1:54 PM >> To: java-user@lucene.apache.org >> Subject: Re: Scaling Lucene to 1bln docs >> >> Would like to know, are you using a particular type of sort? Do you need to >> sort on relevance? Can you shard and restrict your search to a limited set >> of indexes functionally? >> >> -- >> Anshum >> http://blog.anshumgupta.net >> >> Sent from BlackBerry(r) >> >> -----Original Message----- >> From: Shelly_Singh <shelly_si...@infosys.com> >> Date: Tue, 10 Aug 2010 13:31:38 >> To: java-user@lucene.apache.org<java-user@lucene.apache.org> >> Reply-To: java-user@lucene.apache.org >> Subject: RE: Scaling Lucene to 1bln docs >> >> Hi Anshum, >> >> I am already running with the 'setCompoundFile' option off. >> And thanks for pointing out mergeFactor. I had tried a higher mergeFactor >> couple of days ago, but got an OOM, so I discarded it. Later I figured that >> OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt, I >> should try a higher mergeFactor. >> >> With regards to the multithreaded approach, I was considering creating 10 >> different threads each indexing 100mln docs coupled with a Multisearcher to >> which I will feed these 10 indices. Do you think this will improve >> performance. >> >> And just FYI, I have latest reading for 1 bln docs. Indexing time is 2 hrs >> and search time is 15 secs.. I can live with indexing time but the search >> time is highly unacceptable. >> >> Help again. >> >> -----Original Message----- >> From: Anshum [mailto:ansh...@gmail.com] >> Sent: Tuesday, August 10, 2010 12:55 PM >> To: java-user@lucene.apache.org >> Subject: Re: Scaling Lucene to 1bln docs >> >> Hi Shelly, >> That seems like a reasonable data set size. I'd suggest you increase your >> mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in >> memory before writing it to a file (and incurring I/O). You could actually >> flush by RAM usage instead of a Doc count. Turn off using the Compound file >> structure for indexing as it generally takes more time creating a cfs index. >> >> Plus the time would not grow linearly as the larger the size of segments >> get, the more time it'd take to add more docs and merge those together >> intermittently. >> You may also use a multithreaded approach in case reading the source takes >> time in your case, though, the indexwriter would have to be shared among all >> threads. >> >> -- >> Anshum Gupta >> http://ai-cafe.blogspot.com >> >> >> On Tue, Aug 10, 2010 at 12:24 PM, Shelly_Singh >> <shelly_si...@infosys.com>wrote: >> >>> Hi, >>> >>> I am developing an application which uses Lucene for indexing and searching >>> 1 bln documents. (the document size is very small though. Each document has >>> a single field of 5-10 words; so I believe that my data size is within the >>> tested limits). >>> >>> I am using the following configuration: >>> 1. 1.5 gig RAM to the jvm >>> 2. 100GB disk space. >>> 3. Index creation tuning factors: >>> a. mergeFactor = 10 >>> b. maxFieldLength = 10 >>> c. maxMergeDocs = 5000000 (if I try with a larger value, I get an >>> out-of-memory) >>> >>> With these settings, I am able to create an index of 100 million docs (10 >>> pow 8) in 15 mins consuming a disk space of 2.5gb. Which is quite >>> satisfactory for me, but nevertheless, I want to know what else can be done >>> to tune it further. Please help. >>> Also, with these settings, can I expect the time and size to grow linearly >>> for 1bln (10 pow 9) documents? >>> >>> Thanks and Regards, >>> >>> Shelly Singh >>> Center For KNowledge Driven Information Systems, Infosys >>> Email: shelly_si...@infosys.com<mailto:shelly_si...@infosys.com> >>> Phone: (M) 91 992 369 7200, (VoIP)2022978622 >>> >>> >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org