Hi -

I have a basic question on the way queries are processed in Lucene. I understand that Lucene uses a variation of the vector space model in terms of how it detemines document similarity. In particular, I think it computes some sort of normalized TF-IDF score for some query against the collection of documents.

However, my question is this. In order for it to compute the TF-IDF score with respect to a particular document, it would seem that Lucene would need to iterate over all possible documents. For example, given a query q and a document d, compute score(q, d). In order to identify the highest score, it would seem that it would need to look at *all* documents (or else, how does it know how a query evaluates against each a document?). This seems very inefficient, but I'm sure it's not the case -- as I have heard that Lucene is generally pretty efficient.

If someone can please help me understand whether or not this is the case, I would appreciate it.

Just a note: strikes me that an alternative way to do things is to first identify a set of documents that have the term in them first (i.e., a grep) before doing the iteration. In fact, this first step is often more complex in other systems where computing score() is more expensive.

Thanks,

_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to