This was at least one of the threads that was bouncing around... I'm
fairly sure there were others as well.
Hopefully its worth the read to you ^^
http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html
Phil Whelan wrote:
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall w
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall wrote:
> Not sure if this helps you, but some of the issue you are facing seem
> similar to those in the "real time" search threads.
Hi Matthew,
Do you have a pointer of where to go to see the "real time" threads?
Thanks,
Phil
-
Not sure if this helps you, but some of the issue you are facing seem
similar to those in the "real time" search threads.
Basically their problem involves indexing twitter and the blogosphere,
and making lucene work for super large data sets like that.
Perhaps some of the discussion in those
> Out of curiosity, what is the size of your corpus? How much and how
> quickly do you expect it to grow?
in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, bu
Out of curiosity, what is the size of your corpus? How much and how
quickly do you expect it to grow?
I'm just trying to make sure that we are all on the same page here ^^
I can see the benefits of doing what you are describing with a very
large corpus that is expected to grow at quick rate,
> If you did this, wouldn't you be binding the processing of the results
> of all queries to that of the slowest performing one within the collection?
I would imagine it would, but I haven't seen too much variance between
lucene query speeds in our data.
> I'm guessing you are trying for some sor
Queries cannot be ordered "sequentially". Let's say that you run 3 Queries,
w/ one term each "a", "b" and "c". On disk, the posting lists of the terms
can look like this: post1(a), post1(c), post2(a), post1(b), post2(c),
post2(b) etc. They are not guaranteed to be consecutive. The code makes sure
t
> It's not accurate to say that Lucene scans the index for each search.
> Rather, every Query reads a set of posting lists, each are typically read
> from disk. If you pass Query[] which have nothing to do in common (for
> example no terms in common), then you won't gain anything, b/c each Query
>
If you did this, wouldn't you be binding the processing of the results
of all queries to that of the slowest performing one within the collection?
I'm guessing you are trying for some sort of performance benefit by
batch processing, but I question whether or not you will actually get
more perf
It's not accurate to say that Lucene scans the index for each search.
Rather, every Query reads a set of posting lists, each are typically read
from disk. If you pass Query[] which have nothing to do in common (for
example no terms in common), then you won't gain anything, b/c each Query
will alrea
10 matches
Mail list logo