PhraseQuery and edit distance slightly confusing.

Dawid Weiss Wed, 15 Mar 2006 09:03:11 -0800


Hi there,

I get the concept implemented in PhraseQuery but isn't calling it anedit distance a little bit far fetched? Only the marginal elements(minimum and maximum distance from their respective query positions) aretaken into account. Consider this example:


phrase:     a  b  c  d
term pos:   0  1  2  3

document A: a  c  b  d
term pos:   0  1  2  3
pos. diff:  0 -1  1  0

=> slope = (1 - (-1)) = 2

document B: a  c  b  x  d
term pos:   0  1  2  3  4
pos. diff:  0 -1  1  -  1

=> slope = (1 - (-1) = 2

It's how it is currently implemented, isn't it? The scoring difference(attached example) is different just because "document" lengths aredifferent, phrases themselves are scored identically even though Ibelieve B should be penalized. A simple way to do it would be includephrase length divided by the matching span length... but I'm guessingit's implemented like that for a reason, just didn't know what that

reason might be ;)

Dawid

package com.dawidweiss.phd.spikes;

import java.io.IOException;

import junit.framework.TestCase;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.RAMDirectory;

public class PhraseQueryTest extends TestCase {
    public PhraseQueryTest(String s) {
        super(s);
    }
    
    public void testPhraseQuery() throws IOException {
        // Create an in-memory index.
        final RAMDirectory dir = new RAMDirectory();
        final Analyzer analyzer = new SimpleAnalyzer();
        final IndexWriter writer = new IndexWriter(dir, analyzer, true);
        
        final String [] documents = new String [] {
                "a c b x d",
                "a c b d",
        };

        for (final String document : documents) {
            final Document doc = new Document();
            doc.add(new Field("content", document, 
                    Field.Store.YES,
                    Field.Index.TOKENIZED,
                    Field.TermVector.WITH_POSITIONS_OFFSETS));
            writer.addDocument(doc);
        }
        writer.close();
        
        final IndexSearcher searcher = new IndexSearcher(dir);
        final PhraseQuery pq;
        pq = new PhraseQuery();
        pq.setSlop(2);
        pq.add(new Term("content", "a"));
        pq.add(new Term("content", "b"));
        pq.add(new Term("content", "c"));
        pq.add(new Term("content", "d"));

        final BooleanQuery query = new BooleanQuery();
        query.add(pq, Occur.MUST);

        final Hits hits = searcher.search(query);
        for (int i = 0; i < hits.length(); i++) {
            System.out.println("Hit#" + i + ": " + hits.id(i));
            System.out.println(searcher.explain(query, hits.id(i)));
        }
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

PhraseQuery and edit distance slightly confusing.

Reply via email to