Max Lynch wrote:
Well what happens is if I use a SpanScorer instead, and allocate it like
such:
analyzer = StandardAnalyzer([])
tokenStream = analyzer.tokenStream("contents",
lucene.StringReader(text))
ctokenStream = lucene.CachingTokenFilter(tokenStream)
highlighter = lucene.Highlighter(formatter,
lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
ctokenStream.reset()
result = highlighter.getBestFragments(ctokenStream, text,
2, "...")
My highlighter is still breaking up words inside of a span. For
example,
if I search for \"John Smith\", instead of the highlighter being called
for
the whole "John Smith", it gets called for "John" and then "Smith".
I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
which is the default used by Highlighter) to ensure that each fragment
contains a full match for the query. EG something like this (copied
from LIA 2nd edition):
TermQuery query = new TermQuery(new Term("field", "fox"));
TokenStream tokenStream =
new SimpleAnalyzer().tokenStream("field",
new StringReader(text));
SpanScorer scorer = new SpanScorer(query, "field",
new CachingTokenFilter(tokenStream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);
Okay, I hacked something up in Java that illustrates my issue.
import org.apache.lucene.search.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.spans.SpanTermQuery;
import java.io.Reader;
import java.io.StringReader;
public class PhraseTest {
private IndexSearcher searcher;
private RAMDirectory directory;
public PhraseTest() throws Exception {
directory = new RAMDirectory();
Analyzer analyzer = new Analyzer() {
public TokenStream tokenStream(String fieldName, Reader reader)
{
return new WhitespaceTokenizer(reader);
}
public int getPositionIncrementGap(String fieldName) {
return 100;
}
};
IndexWriter writer = new IndexWriter(directory, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
String text = "Jimbo John is his name";
doc.add(new Field("contents", text, Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory);
// Try a phrase query
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("contents", "Jimbo"));
phraseQuery.add(new Term("contents", "John"));
// Try a SpanTermQuery
SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
"Jimbo John"));
// Try a parsed query
Query parsedQuery = new QueryParser("contents",
analyzer).parse("\"Jimbo John\"");
Hits hits = searcher.search(parsedQuery);
System.out.println("We found " + hits.length() + " hits.");
// Highlight the results
CachingTokenFilter tokenStream = new
CachingTokenFilter(analyzer.tokenStream( "contents", new
StringReader(text)));
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
"contents");
Highlighter highlighter = new Highlighter(formatter, sc);
highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
tokenStream.reset();
String rv = highlighter.getBestFragments(tokenStream, text, 1,
"...");
System.out.println(rv);
}
public static void main(String[] args) {
System.out.println("Starting...");
try {
PhraseTest pt = new PhraseTest();
} catch(Exception ex) {
ex.printStackTrace();
}
}
}
The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
<B>Jimbo</B> then <B>John</B>. Can I get around this some how? I tried
several different query types (they are declared in the code, but only the
parsed version is being used).
Thanks
-max
Sorry, not much you can do at the moment. The change is non trivial for
sure (its probably easier to write some regex that merges them). This
limitation was accepted because with most markup, it will display the
same anyway. An option to merge would be great, and while I don't
remember the details, the last time I looked, it just ain't easy to do
based on the implementation. The highlighter highlights by running
through and scoring tokens, not phrases, and the Span highlighter asks
if a given token is in a given span to see if it should get a score over
0. Token by token handed off to the SpanScorer to be scored. I looked
into adding the option at one point (back when I was putting the
SpanScorer together) and didn't find it worth the effort after getting
blocked a couple times.
--
- Mark
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org