RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
I experimented with it, but somehow (I am not convinced why) I got poorer indexing performance with higher RAM. That was an initial experiment and I did not dig into it. But, for time being, I have acceptable indexing speed so I am only focusing on reducing search time. Thanks and Regards, She

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
So, you didn't really use the setRamBuffer.. ? Any reasons for that? -- Anshum Gupta http://ai-cafe.blogspot.com On Wed, Aug 11, 2010 at 10:28 AM, Shelly_Singh wrote: > My final settings are: > 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop > 2. 100GB disk space. > 3.

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
My final settings are: 1. 1.5 gig RAM to the jvm out of 2GB available for my desktop 2. 100GB disk space. 3. Index creation and searching tuning factors: a. mergeFactor = 10 b. maxFieldLength = 10 c. maxMergeDocs = 500 d. full optimize at end of i

read past EOF

2010-08-10 Thread Ganesh
Hello all, I am getting the following exception for one of my customer. I think the database is corrupted but want to know the exact cause. Exception: read past EOF org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135)

Partial matching with spaces

2010-08-10 Thread L Duperval
I am using Compass which uses Lucene 2.4.1. We are trying to implement "find as you type" searching. So for example, if I have documents containing the following: Newfoundland New Mexico New York New Yorkshire If a user types "New" all four documents are returned. When the user adds a space, onl

RE: Index merge question

2010-08-10 Thread IKoelliker
Would it matter if an IndexReader was opened while an index merge is in progress? Thanks -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, August 10, 2010 12:03 PM To: java-user@lucene.apache.org Subject: Re: Index merge question When you open an In

Re: Index merge question

2010-08-10 Thread Erick Erickson
When you open an IndexReader, Lucene effectively takes a "snapshot" of the index and searches it until you reopen your reader. So the timing of when the merged index gets used is up to you, so you should be fine. Best Erick On Tue, Aug 10, 2010 at 11:28 AM, wrote: > Hello, > > Is there any poin

Index merge question

2010-08-10 Thread IKoelliker
Hello, Is there any point during a merge operation where the index cannot be searched or is unstable? We want to create a bunch of smaller indexes in parallel and then merge them into a single index that may have searches running against it. Thanks, Ian Koelliker

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Pablo Mendes
Shelly, Do you mind sharing with the list the final settings you used for your best results? Cheers, Pablo On Tue, Aug 10, 2010 at 3:49 PM, anshum.gu...@naukri.com wrote: > Hey Shelly, > If you want to get more info on lucene, I'd recommend you get a copy of > lucene in action 2nd Ed. It'll help

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread anshum.gu...@naukri.com
Hey Shelly, If you want to get more info on lucene, I'd recommend you get a copy of lucene in action 2nd Ed. It'll help you get a hang of a lot of things! :) -- Anshum http://blog.anshumgupta.net Sent from BlackBerry® -Original Message- From: Shelly_Singh Date: Tue, 10 Aug 2010 19:11:1

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
Hi folks, Thanks for the excellent support n guidance on my very first day on this mailing list... At end of day, I have very optimistic results. 100bln search in less than 1ms and the index creation time is not huge either ( close to 15 minutes). I am now hitting the 1bln mark with roughly the

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
Hi All, Some very promising findings... for 100 mln ( a factor of 10 less than my goal), I could bring the search speed to 'single-digit' mili seconds. The major change is that I am now optimizing the index, which I was shying from doing earlier. For fun, I am planning to take a reading of 1 b

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-somethingyou get the point" it will go to all shards so the whole point of shards will be compromised...you'll have 26 billion documents index ;) Looks like the only way is to search all shards. D

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
Hmm..I get the point. But, in my application, the document is basically a descriptive name of a particular thing. The user will search by name (or part of name) and I need to pull out all info pointed to by that name. This info is externalized in a db. One option I can think of is- I can shard

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread prashant ullegaddi
You might want to take a look at RemoteSearchable ( http://lucene.apache.org/java/2_9_2/api/contrib-remote/org/apache/lucene/search/RemoteSearchable.html) -- it'll be helpful if you place shards on different servers. On Tue, Aug 10, 2010 at 6:08 PM, Shelly_Singh wrote: > - shard into 10 indices.

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
I'd second that. It doesn't have to be date for sharding. Maybe every query has some specific field, like UserId or something, so you can redirect to specific shard instead of hitting all 10 indices. You have to have some kind of narrowing: searching 1bn documents with queries that may hit all do

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
- shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. Shelly - Actually my documents are originally database records with each being equally important. - have a pool of IndexSearchers for

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
I do not see a way to optimally decide how to shard the data. Its very difficult for my purpose; and so the safe bet is to assume that all indices will need to be searched. Okay, I can try ParallelMultiSearcher in addition to MultiSearcher. -Original Message- From: Anshum [mailto:ansh.

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Dan OConnor
Shelly: You wouldn't necessarily have to use a multisearcher. A suggested alternative is: - shard into 10 indices. If you need the concept of a date range search, I would assign the documents to the shard by date, otherwise random assignment is fine. - have a pool of IndexSearchers for each in

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
Searching on all indices shouldn't be that bad an idea instead of searching a single huge index, specially considering you have a constraint on the usable memory. You could use a ParallelMultiSearcher which spawns threads to query across indexes and merges the results. What I asked was, is there a

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -Original Message- From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] Sent: Tuesday, August 10, 2010 1:54 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Would

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread findbestopensource
Hi Shelly, You need to reduce your maxMergeDocs. set ramBufferSizeMB to 100, which will help you to use less RAM in indexing. >>>search time is 15 secs.. How you are calculating this time. Just taking time difference before and after the search method or this involves time to parse the document o

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Michael McCandless
Correction: mergeFactor determines how many segments are merged at once. It's IndexWriter's ramBufferSizeMB and/or maxBufferedDocs that determine how many docs are buffered in RAM before a new segment is flushed. A higher mergeFactor will require more RAM during merging, will cause longer running

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread anshum.gu...@naukri.com
Would like to know, are you using a particular type of sort? Do you need to sort on relevance? Can you shard and restrict your search to a limited set of indexes functionally? -- Anshum http://blog.anshumgupta.net Sent from BlackBerry® -Original Message- From: Shelly_Singh Date: Tue,

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
Hi Danil, I get ur point. Infact, the latest readings I have for 1bln docs is also asserting the same thing. Index creation time is 2 hours.. which is fine by me... but search time is 15 secs.. which is too high for any application. I am planning to do a sharding of indices and then use a multi

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor couple of days ago, but got an OOM, so I discarded it. Later I figured that OOM was because maxMergeDocs was unlimited and I was using MMap. U r rigt,

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
The problem actually won't be the indexing part. Searching such large dataset will require a LOT of memory. If you'll need sorting or faceting on one of the fields, jvm will explode ;) Also GC times on large jvm heap are pretty disturbing (if you care about your search performance) So I'd advise

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Co