Hi there,
I get the concept implemented in PhraseQuery but isn't calling it an
edit distance a little bit far fetched? Only the marginal elements
(minimum and maximum distance from their respective query positions) are
taken into account. Consider this example:
phrase: a b c d
term pos: 0 1 2 3
document A: a c b d
term pos: 0 1 2 3
pos. diff: 0 -1 1 0
=> slope = (1 - (-1)) = 2
document B: a c b x d
term pos: 0 1 2 3 4
pos. diff: 0 -1 1 - 1
=> slope = (1 - (-1) = 2
It's how it is currently implemented, isn't it? The scoring difference
(attached example) is different just because "document" lengths are
different, phrases themselves are scored identically even though I
believe B should be penalized. A simple way to do it would be include
phrase length divided by the matching span length... but I'm guessing
it's implemented like that for a reason, just didn't know what that
reason might be ;)
Dawid
package com.dawidweiss.phd.spikes;
import java.io.IOException;
import junit.framework.TestCase;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.RAMDirectory;
public class PhraseQueryTest extends TestCase {
public PhraseQueryTest(String s) {
super(s);
}
public void testPhraseQuery() throws IOException {
// Create an in-memory index.
final RAMDirectory dir = new RAMDirectory();
final Analyzer analyzer = new SimpleAnalyzer();
final IndexWriter writer = new IndexWriter(dir, analyzer, true);
final String [] documents = new String [] {
"a c b x d",
"a c b d",
};
for (final String document : documents) {
final Document doc = new Document();
doc.add(new Field("content", document,
Field.Store.YES,
Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.addDocument(doc);
}
writer.close();
final IndexSearcher searcher = new IndexSearcher(dir);
final PhraseQuery pq;
pq = new PhraseQuery();
pq.setSlop(2);
pq.add(new Term("content", "a"));
pq.add(new Term("content", "b"));
pq.add(new Term("content", "c"));
pq.add(new Term("content", "d"));
final BooleanQuery query = new BooleanQuery();
query.add(pq, Occur.MUST);
final Hits hits = searcher.search(query);
for (int i = 0; i < hits.length(); i++) {
System.out.println("Hit#" + i + ": " + hits.id(i));
System.out.println(searcher.explain(query, hits.id(i)));
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]