Hi -
I have a basic question on the way queries are processed in Lucene. I
understand that Lucene uses a variation of the vector space model in terms
of how it detemines document similarity. In particular, I think it computes
some sort of normalized TF-IDF score for some query against the collection
of documents.
However, my question is this. In order for it to compute the TF-IDF score
with respect to a particular document, it would seem that Lucene would need
to iterate over all possible documents. For example, given a query q and a
document d, compute score(q, d). In order to identify the highest score, it
would seem that it would need to look at *all* documents (or else, how does
it know how a query evaluates against each a document?). This seems very
inefficient, but I'm sure it's not the case -- as I have heard that Lucene
is generally pretty efficient.
If someone can please help me understand whether or not this is the case, I
would appreciate it.
Just a note: strikes me that an alternative way to do things is to first
identify a set of documents that have the term in them first (i.e., a grep)
before doing the iteration. In fact, this first step is often more complex
in other systems where computing score() is more expensive.
Thanks,
_________________________________________________________________
Dont just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]