Re: Batch searching

2009-07-23 Thread Matthew Hall
This was at least one of the threads that was bouncing around... I'm fairly sure there were others as well. Hopefully its worth the read to you ^^ http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html Phil Whelan wrote: On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall w

Re: Batch searching

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall wrote: > Not sure if this helps you, but some of the issue you are facing seem > similar to those in the "real time" search threads. Hi Matthew, Do you have a pointer of where to go to see the "real time" threads? Thanks, Phil -

Re: Batch searching

2009-07-22 Thread Matthew Hall
Not sure if this helps you, but some of the issue you are facing seem similar to those in the "real time" search threads. Basically their problem involves indexing twitter and the blogosphere, and making lucene work for super large data sets like that. Perhaps some of the discussion in those

Re: Batch searching

2009-07-22 Thread tsuraan
> Out of curiosity, what is the size of your corpus? How much and how > quickly do you expect it to grow? in terms of lucene documents, we tend to have in the 10M-100M range. Currently we use merging to make larger indices from smaller ones, so a single index can have a lot of documents in it, bu

Re: Batch searching

2009-07-22 Thread Matthew Hall
Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow? I'm just trying to make sure that we are all on the same page here ^^ I can see the benefits of doing what you are describing with a very large corpus that is expected to grow at quick rate,

Re: Batch searching

2009-07-22 Thread tsuraan
> If you did this, wouldn't you be binding the processing of the results > of all queries to that of the slowest performing one within the collection? I would imagine it would, but I haven't seen too much variance between lucene query speeds in our data. > I'm guessing you are trying for some sor

Re: Batch searching

2009-07-22 Thread Shai Erera
Queries cannot be ordered "sequentially". Let's say that you run 3 Queries, w/ one term each "a", "b" and "c". On disk, the posting lists of the terms can look like this: post1(a), post1(c), post2(a), post1(b), post2(c), post2(b) etc. They are not guaranteed to be consecutive. The code makes sure t

Re: Batch searching

2009-07-22 Thread tsuraan
> It's not accurate to say that Lucene scans the index for each search. > Rather, every Query reads a set of posting lists, each are typically read > from disk. If you pass Query[] which have nothing to do in common (for > example no terms in common), then you won't gain anything, b/c each Query >

Re: Batch searching

2009-07-22 Thread Matthew Hall
If you did this, wouldn't you be binding the processing of the results of all queries to that of the slowest performing one within the collection? I'm guessing you are trying for some sort of performance benefit by batch processing, but I question whether or not you will actually get more perf

Re: Batch searching

2009-07-22 Thread Shai Erera
It's not accurate to say that Lucene scans the index for each search. Rather, every Query reads a set of posting lists, each are typically read from disk. If you pass Query[] which have nothing to do in common (for example no terms in common), then you won't gain anything, b/c each Query will alrea