Re: Location of code which determines a Hit for PhraseQuery

Paul Elschot Thu, 08 Sep 2005 00:23:46 -0700

Hi Sean,

On Thursday 08 September 2005 05:16, Sean O'Connor wrote:
> Hi,
>     I am trying to work through the Hit collection process for a 
> PhraseQuery (using an exact phrase). For an example search, say I'm 
> looking for:
> "lucene action"  (quotes indicating exact phrase)
> 
> in a one doc, one field index consisting of:
> wow, lucene rocks, lucene action items are cool, very action packed
> 
> 
> So my questions are:
> - What does (Default)Similarity.idf() do?
>     Just create a base frequency for a term in the query?
>     i.e. it creates something of a raw score, based on term (not phrase) 
> frequency in a doc


It defines a basic weight for a term in a query. It uses the number of
documents containing the term and the total number of documents
in the index. This value does not vary over different documents 
in the same query.
 
> - Does (Phrase)Weight do the work of finding a phrase match?
>     I don't think so, but get confused in the weight/scorer functionality

A Weight determines the weights to be used for scoring a query
for a particular query search. It collects various weights from
parts of the query and then normalizes these. For nested
queries this can involve some of recursion.
The javadoc of Weight in the svn trunk is more detailed than in 1.4.3.
 
> -Does ExactPhraseScorer do the work of finding a phrase match?

Yes, but it does a bit more: it also counts the number of phrase matches
in the document. The score of a phrase in a document only depends
on this count.

>     I think so, but seem to keep missing where
>     i.e. where does the first instance of "lucene" (lucene rocks) get 
> skipped but the second one (lucene action) become a hit?

In ExactPhraseScorer.phraseFreq() the freq variable
is not incremented for the first instance, but it is incremented
for the second one.

Non exact phrases can have more hits in a document, so they
need another implementation of phraseFreq().

> - What does SegmentTermPositions do?
>     I think this is critical to the PhrasePositions process, which 
> PhraseScorer uses
>     I think it hold pointers to the positions in the index file streams 
> (ability to read the indexes?)

Correct. This is one of the things that make Lucene fast. It is
package private because it is not normally used directly
in a Lucene application.
Optimizing an index merges all the segments into one.
 
Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Location of code which determines a Hit for PhraseQuery

Reply via email to