Re: RangeFilter performance problem using MultiReader

2009-04-12 Thread Yonik Seeley
Hmmm, something is wrong range queries over many terms should definitely be faster. There are some other oddities in your results... - the "consolidated index" shows to be slower 295ms vs 602ms... but patch 1596 doesn't touch that code path (a single segment index). - TEST2 (using searcher.sear

Re: RangeFilter performance problem using MultiReader

2009-04-12 Thread Raf
I am sorry, but after applying this patch, the performance on my tests are worse than those on lucene-2.9-dev trunk. TEST1: using *filter.getDocIdSet(reader)*; *Test *results* (Num docs = 2,940,738) using lucene-core-2.9-dev trunk** 1 Original index (12 collections * 6 months = 72 indexes)*

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Yonik Seeley
OK, I think this will improve the situation: https://issues.apache.org/jira/browse/LUCENE-1596 -Yonik http://www.lucidimagination.com On Fri, Apr 10, 2009 at 1:47 PM, Michael McCandless wrote: > We never fully explained it, but we have some ideas... > > It's only if you iterate each term, and d

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Erick Erickson
en > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Saturday, April 11, 2009 6:42 PM > > To: java-user@lucene.apache.org > > Subject: Re: RangeFilter perfor

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -Original Message----- > > > From: Michael McCandless [mailto:luc...@mikemccandless.com] > > > Sent: Saturday, April 11, 2009 4:03 PM > > > To: java-user@lucene.apache.org > > &

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Erick Erickson
er-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Michael McCandless [mailto:luc...@mikemccandless.com] > > Sent: Saturday, April 11, 2009 4:03 PM > > To: java-user@lucene.apache.org > > Su

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Saturday, April 11, 2009 4:03 PM > To: java-user@lucene.apache.org > Subject: Re: RangeFilter performance problem using MultiReader > > Ahhh, OK, perhaps that expl

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Michael McCandless
Ahhh, OK, perhaps that explains the sizable perf difference you're seeing w/ optimized vs not. I'm curious to see the results of your "merge each month into 1 index" test... Mike On Sat, Apr 11, 2009 at 9:21 AM, Roberto Franchini wrote: > On Sat, Apr 11, 2009 at 1:50 PM, Michael McCandless > w

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Roberto Franchini
On Sat, Apr 11, 2009 at 1:50 PM, Michael McCandless wrote: > Hmm then I'm a bit baffled again. > > Because, each of your "by month" indexes presumably has a unique > subset of terms for the "date_doc" field?  Meaning, a given "by month" > index will have all date_doc corresponding to that month, a

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Michael McCandless
Hmm then I'm a bit baffled again. Because, each of your "by month" indexes presumably has a unique subset of terms for the "date_doc" field? Meaning, a given "by month" index will have all date_doc corresponding to that month, and a different "by month" index would presumably have no overlap in t

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Roberto Franchini
On Sat, Apr 11, 2009 at 11:48 AM, Michael McCandless wrote: > On Sat, Apr 11, 2009 at 5:27 AM, Raf wrote: > [cut] > > You have readers from 72 different directories, but is each directory > an optimized or unoptimized index? Hi, I'm Raffaella's collegue, and I'm the "indexer" while she is the "s

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Michael McCandless
On Sat, Apr 11, 2009 at 5:27 AM, Raf wrote: > I have repeated my tests using a searcher and now the performance on 2.9 are > very better than those on 2.4.1, especially when the filter extracts a lot > of docs. OK, phew! > However the same search on the consolidated index is even faster This i

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
dex. This is not > faster in 2.9. > > To compare speed, please use real search code (Searcher.search())! > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Raf [mailto:r.ventag...@gmail.com] > > Sent: Saturday, April 11, 2009 9:07 AM > > To: java-user@lucene.apache.org > > Subject: Re: RangeFilter performance problem using MultiReader > > >

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
o: java-user@lucene.apache.org > Subject: Re: RangeFilter performance problem using MultiReader > > Thanks Uwe, > I had already read about TrieRangeFilter on this mailing list and I > thought > it could be useful to solve my problem. > I think I will trie it for test purposes.

RE: RangeFilter performance problem using MultiReader

2009-04-11 Thread Uwe Schindler
eMail: u...@thetaphi.de > -Original Message- > From: Raf [mailto:r.ventag...@gmail.com] > Sent: Saturday, April 11, 2009 9:07 AM > To: java-user@lucene.apache.org > Subject: Re: RangeFilter performance problem using MultiReader > > Ok, here you can find some d

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
Thanks Uwe, I had already read about TrieRangeFilter on this mailing list and I thought it could be useful to solve my problem. I think I will trie it for test purposes. Unfortunately, I have now to solve the problem in a production system and I would like to avoid using a yet unreleased version.

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
No, it is a MultiReader that contains 72 (I am sorry, I wrote a wrong number last time) "single" readers. Raf On Fri, Apr 10, 2009 at 9:14 PM, Mark Miller wrote: > Raf wrote: > >> >> We have more or less 3M documents in 24 indexes and we read all of them >> using a MultiReader. >> >> > > Is this

Re: RangeFilter performance problem using MultiReader

2009-04-11 Thread Raf
Ok, here you can find some details about my tests: *MultiReader creation* IndexReader subReader; List subReaders = new ArrayList(); for (Directory dir : this.directories) { try { subReader = IndexReader.open(dir, true); subReaders.add(subReader); } catch (...) {

RE: RangeFilter performance problem using MultiReader

2009-04-10 Thread Uwe Schindler
You got a lot of answers and questions about your index structure. Now another idea, maybe this helps you to speed up your RangeFilter: What type of range do you want to query? From your index statistics, it looks like a numeric/date field from which you filter very large ranges. If the values are

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 3:06 PM, Mark Miller wrote: > 24 segments is bound to be quite a bit slower than an optimized index for > most things I'd be curious just how true this really is (in general)... my guess is the "long tail of tiny segments" gets into the OS's IO cache (as long as the syste

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 3:11 PM, Mark Miller wrote: > Mark Miller wrote: >> >> Michael McCandless wrote: >>> >>> which is why I'm baffled that Raf didn't see a speedup on >>> upgrading. >>> >>> Mike >>> >> >> Another point is that he may not have such a nasty set of segments - Raf >> says he has 2

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 3:14 PM, Mark Miller wrote: > Raf wrote: >> >> We have more or less 3M documents in 24 indexes and we read all of them >> using a MultiReader. >> > > Is this a multireader containing multireaders? Let's hear Raf's answer, but I think likely "yes". But this shouldn't be a

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Raf wrote: We have more or less 3M documents in 24 indexes and we read all of them using a MultiReader. Is this a multireader containing multireaders? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Mark Miller wrote: Michael McCandless wrote: which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Another point is that he may not have such a nasty set of segments - Raf says he has 24 indexes, which sounds like he may not have the logarithmic sizing you normally see

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Michael McCandless wrote: which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Another point is that he may not have such a nasty set of segments - Raf says he has 24 indexes, which sounds like he may not have the logarithmic sizing you normally see. If you have somewh

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Michael McCandless wrote: On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller wrote: I had thought we would also see the advantage with multi-term queries - you rewrite against each segment and avoid extra seeks (though not nearly as many as when enumerating every term). As Mike pointed out to me

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller wrote: > I had thought we would also see the advantage with multi-term queries - you > rewrite against each segment and avoid extra seeks (though not nearly as > many as when enumerating every term). As Mike pointed out to me back when > though : we st

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
When I did some profiling I saw that the slow down came from tons of extra seeks (single segment vs multisegment). What was happening was, the first couple segments would have thousands of terms for the field, but as the segments logarithmically shrank in size, the number of terms for the segme

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 1:20 PM, Raf wrote: > Hi Mike, > thank you for your answer. > > I have downloaded lucene-core-2.9-dev and I have executed my tests (both on > multireader and on consolidated index) using this new version, but the > performance are very similar to the previous ones. > The bi

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
On Fri, Apr 10, 2009 at 11:03 AM, Yonik Seeley wrote: > On Fri, Apr 10, 2009 at 10:48 AM, Michael McCandless > wrote: >> Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms >> (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. > > Do we know why this is, and

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Raf
Hi Mike, thank you for your answer. I have downloaded lucene-core-2.9-dev and I have executed my tests (both on multireader and on consolidated index) using this new version, but the performance are very similar to the previous ones. The big index is 7/8 times faster than multireader version. Raf

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Yonik Seeley
On Fri, Apr 10, 2009 at 10:48 AM, Michael McCandless wrote: > Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms > (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. Do we know why this is, and if it's fixable (the MultiTermEnum, not the higher level query o

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Michael McCandless
Unfortunately, in Lucene 2.4, any query that needs to enumerate Terms (Prefix, Wildcard, Range, etc.) has poor performance on Multi*Readers. I think the only workaround is to merge your indexes down to a single index. But, Lucene trunk (not yet released) has fixed this, so that searching through