Oh my goodness yes. No wonder nothing I suggested made any difference <G>. Ignore everything I've written....
OK, here's something to try, and it goes back to a Filter. Rather than make this enormous bunch of ORs, try creating a Filter. Use TermDocs to run through your list of IDs assembling a Filter (the one with the BitSet) and send *that* through to your query rather than all your clauses. Running through the terms is much faster than I thought. The only issue is that TermDocs is ordered by doc IDs, so you'll have to use, say, a Map to see if the unique IDs you're after correspond to your desired IDs.... Or who knows? Just using TermDocs.seek(Term) for your list might do the trick.... Term.createTerm() will help a bit here.... You can get a feel for this by just spinning through your list of unique IDs using TermDocs.seek(Term) to see how long assembling the filter would take. Using the Filter in a query doesn't cost much at all.... Best Erick On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <erans...@gmail.com> wrote: > Erick, > > Maybe I didn't make myself clear enough. > I'm talking about high level filters used when searching. > > I construct a very big BooleanQuery and add 50K clauses to it (I removed > the > limit on max clauses). > Each clause is a TermQuery on the same field. > I don't know the internal doc ids that I want. I only know the value stored > in the field. this value is an external key of the document and guaranteed > to be unique. > > I pass this query to QueryFilter and run a search using search > (query,filter,collector) (reminder: i'm using ver 2.4) > > If I missed some other magic construction method to achieve the same goal > then it would be great. > > Thanks for you help so far, > Eran. > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > Now I'm really confused, which usually means I'm making some > > assumptions that aren't true. So here they are... > > > > 1> You're talking about Filters that contain BitSets, right? Not some > other > > kind of filter. > > 2> When you create your 10-50K filters, you wind up with a single filter > > by combining them all with one of the BitSet operators, right? And > > that *single* filter is the one you send to your query... > > > > If I'm off base here, could you post a reasonable extract of your filter > > construction code, and how you use them to search? Because I don't > > think we're all talking about the same thing here..... > > > > HTH > > er...@thismakesnosensetome<G>... > > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <erans...@gmail.com> wrote: > > > > > After commenting out the collector logic, the time is still more or > less > > > the > > > same. > > > Anyway, since without the filter collecting the documents is very fast > > it's > > > probably something with the filter itself. > > > > > > I don't know how the filter (or boolean query) work internally but > > probably > > > for 10K or 50K clauses, it does something that takes a lot of time. It > > > might > > > be because of the inner data structures that are used or maybe just the > > > iteration on so many terms takes time. > > > > > > I'll continue to try and pinpoint the exact bottleneck postion, or > maybe > > > using the new filters in 2.9.1 might help. > > > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > Hmmm, could you show us what you do in your collector? Because > > > > one of the gotchas about a collector is loading the documents in > > > > the inner loop. Quick test: comment out whatever you're doing in > > > > the underlying collector loop, and see if there's *any* noticeable > > > > difference in speed. That'll tell you whether your problems > > > > arise from the filter construction/search or what you're doing > > > > in the collector.... > > > > > > > > Best > > > > Erick > > > > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <erans...@gmail.com> > > wrote: > > > > > > > > > I think it shouldn't take X5 times longer since the number of > results > > > is > > > > > only about X2 times larger (and much smaller than the number of > terms > > > in > > > > > the > > > > > filter), but maybe I'm wrong here since I'm not familiar with the > > > filter > > > > > internals. > > > > > > > > > > Unfortunately, the time to construct the filter is mere > milliseconds. > > > > > almost all of the time (~5secs) are spent in the search method. > > > > > I'm using a collector to retrieve all the results (and fetch a > value > > > for > > > > > some fields) but without the filter this also takes less then a > > second > > > > for > > > > > the same number of results. > > > > > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson < > > > erickerick...@gmail.com > > > > > >wrote: > > > > > > > > > > > Hmmm, I'm not very clear here. Are you saying that you > effectively > > > > > > form 10-50K filters and OR them all together? That would be > > > > > > consistent with the 50K case taking approx. 5X a long as the 10K > > > > > > case..... > > > > > > > > > > > > Do you know where in your code the time is being spent? That'd > > > > > > be a big help in suggesting alternatives. If I'm on the right > > track, > > > > > > I'd expect the time to be spent assembling the filters..... > > > > > > > > > > > > Not much help here, but I'm having trouble wrapping my head > > > > > > around this... > > > > > > > > > > > > Best > > > > > > Erick > > > > > > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <erans...@gmail.com> > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I have a need to filter my queries using a rather large subset > of > > > > terms > > > > > > > (can > > > > > > > be 10K or even 50K). > > > > > > > All these terms are sure to exist in the index so the number of > > > > results > > > > > > can > > > > > > > be about the same number of terms in the filter. > > > > > > > The terms are numbers but are not subsequent and are from a > large > > > set > > > > > of > > > > > > > possible values (so range queries are probably not good for > me). > > > > > > > The index itself is about 1M docs and running even a simple > query > > > > with > > > > > > such > > > > > > > a large filter takes a lot of time even if the number of > results > > is > > > > > only > > > > > > a > > > > > > > few hundred docs. > > > > > > > It seems like the speed is affected by the length of the filter > > > even > > > > if > > > > > > the > > > > > > > number of results remains more or less the same, which is > logical > > > but > > > > > not > > > > > > > by > > > > > > > such a large loss of performance as I'm experiencing (running > the > > > > query > > > > > > > with > > > > > > > a 10K terms filter takes an average of 1s 187ms with 600 > results > > > > while > > > > > > > running it with a 50K terms filter takes an average of 5s 207ms > > > with > > > > > 1000 > > > > > > > results). > > > > > > > > > > > > > > Currently I'm using a QueryFilter with a boolean query in which > I > > > > "OR" > > > > > > the > > > > > > > different terms together. > > > > > > > I also can't use a cached filter efficiently since the terms to > > > > filter > > > > > on > > > > > > > change almost every query. > > > > > > > > > > > > > > I was wondering if there's a better way to filter my queries so > > > they > > > > > > won't > > > > > > > take a few seconds to run? > > > > > > > > > > > > > > Thanks in advance for any advise, > > > > > > > Eran. > > > > > > > > > > > > > > > > > > > > > > > > > > > >