Re: Can you use reduced sized test indexes to predict performance gains for a larger index?

2010-02-15 Thread Peter Keegan
Same experience here as Tom. Disk I/O becomes bottleneck with large indexes (or multiple shards per server) with less memory. Frequent updates to indexes can make the I/O bottleneck worse. Peter On Mon, Feb 15, 2010 at 2:17 PM, Tom Burton-West wrote: > > Hi Chris, > > In our experience with larg

Re: Controlling what is indexed / normalizing our index

2010-02-15 Thread Ahmet Arslan
> We have a list of keywords with aliases (Example:  > keyword = "ms access" > aliases = "microsoft access", "msaccess", "m.s. > access"  ) > > We would like to intercept the aliases prior to them being > indexed, and have > the keyword indexed instead.  We can do this with a > CustomFilter for s

Controlling what is indexed / normalizing our index

2010-02-15 Thread maxSchlein
We have a list of keywords with aliases (Example: keyword = "ms access" aliases = "microsoft access", "msaccess", "m.s. access" ) We would like to intercept the aliases prior to them being indexed, and have the keyword indexed instead. We can do this with a CustomFilter for single word aliases

Re: Can you use reduced sized test indexes to predict performance gains for a larger index?

2010-02-15 Thread Tom Burton-West
Hi Chris, In our experience with large indexes (about 200-300GB) , we found most of our bottlenecks involved disk I/O. We found that if our experimental indexes were too small, that much of the index could fit in cache, and so our test results were not applicable to our larger indexes. On the

PayloadNearSpanScorer explain method

2010-02-15 Thread Peter Keegan
The 'explain' method in PayloadNearSpanScorer assumes the AveragePayloadFunction was used. I don't see an easy way to override this because 'payloadsSeen' and 'payloadScore' are private/protected. It seems like the 'PayloadFunction' interface should have an 'explain' method that the Scorer could ca

Re: question regarding BooleanQuery:equals() method

2010-02-15 Thread Smith G
Hello All, I am really sorry for not following the rules and bringing it to the top. It is important at the moment. Thanks. On 11 February 2010 15:51, Smith G wrote: > Hello All, >            I am writing some test cases for a custom-class which > modifies incoming TermQuery and

Re: Strange Fuzzyquery results scoring when using a low minimal distance

2010-02-15 Thread mark harwood
This could be down to IDF ie "Lucane" is ranked higher because it is rarer despite having worse edit distance. This is arguably a bug. See http://issues.apache.org/jira/browse/LUCENE-329 which discusses this. You could try subclass QueryParser and override newFuzzyQuery to return FuzzyLikeThisQu

Strange Fuzzyquery results scoring when using a low minimal distance

2010-02-15 Thread stefcl
Hello, I'm using Lucene v3. Please consider the following spellings Lucene Lucéne lucéne Lucane Lucen When searching for "lucéne" among those words using a FuzzyQuery (with 0.5 edit distance), results show : 1. Lucene 1.0259752 2. Lucane 1.0259752 3. Lucéne 0.95660806 4. lucéne 0.95660806 5.

Re: Further refinement of search results - distinguishing hits with exact phrase match from the rest

2010-02-15 Thread mark harwood
Re Mike's delegating custom query suggestion - see https://issues.apache.org/jira/browse/LUCENE-1999 - Original Message From: Michael McCandless To: java-user@lucene.apache.org Sent: Mon, 15 February, 2010 10:03:30 Subject: Re: Further refinement of search results - distinguishing hi

Re: Further refinement of search results - distinguishing hits with exact phrase match from the rest

2010-02-15 Thread Michael McCandless
I don't think Lucene makes this easy, today, out of the box. The scoring process for a boolean query doesn't track which sub-clause had matched. Though, it does track the number of clauses that matched (coord). EG you'd be able to tell that a given hit had both clauses match, vs only 1 (just not