> Well what happens is if I use a SpanScorer instead, and allocate it like
> > such: > > > > analyzer = StandardAnalyzer([]) > > tokenStream = analyzer.tokenStream("contents", > > lucene.StringReader(text)) > > ctokenStream = lucene.CachingTokenFilter(tokenStream) > > highlighter = lucene.Highlighter(formatter, > > lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream)) > > ctokenStream.reset() > > > > result = highlighter.getBestFragments(ctokenStream, text, > > 2, "...") > > > > My highlighter is still breaking up words inside of a span. For > example, > > if I search for \"John Smith\", instead of the highlighter being called > for > > the whole "John Smith", it gets called for "John" and then "Smith". > > I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter, > which is the default used by Highlighter) to ensure that each fragment > contains a full match for the query. EG something like this (copied > from LIA 2nd edition): > > TermQuery query = new TermQuery(new Term("field", "fox")); > > TokenStream tokenStream = > new SimpleAnalyzer().tokenStream("field", > new StringReader(text)); > > SpanScorer scorer = new SpanScorer(query, "field", > new CachingTokenFilter(tokenStream)); > Fragmenter fragmenter = new SimpleSpanFragmenter(scorer); > Highlighter highlighter = new Highlighter(scorer); > highlighter.setTextFragmenter(fragmenter); Okay, I hacked something up in Java that illustrates my issue. import org.apache.lucene.search.*; import org.apache.lucene.analysis.*; import org.apache.lucene.document.*; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.search.highlight.*; import org.apache.lucene.search.spans.SpanTermQuery; import java.io.Reader; import java.io.StringReader; public class PhraseTest { private IndexSearcher searcher; private RAMDirectory directory; public PhraseTest() throws Exception { directory = new RAMDirectory(); Analyzer analyzer = new Analyzer() { public TokenStream tokenStream(String fieldName, Reader reader) { return new WhitespaceTokenizer(reader); } public int getPositionIncrementGap(String fieldName) { return 100; } }; IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.LIMITED); Document doc = new Document(); String text = "Jimbo John is his name"; doc.add(new Field("contents", text, Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); writer.optimize(); writer.close(); searcher = new IndexSearcher(directory); // Try a phrase query PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term("contents", "Jimbo")); phraseQuery.add(new Term("contents", "John")); // Try a SpanTermQuery SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents", "Jimbo John")); // Try a parsed query Query parsedQuery = new QueryParser("contents", analyzer).parse("\"Jimbo John\""); Hits hits = searcher.search(parsedQuery); System.out.println("We found " + hits.length() + " hits."); // Highlight the results CachingTokenFilter tokenStream = new CachingTokenFilter(analyzer.tokenStream( "contents", new StringReader(text))); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(); SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream, "contents"); Highlighter highlighter = new Highlighter(formatter, sc); highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc)); tokenStream.reset(); String rv = highlighter.getBestFragments(tokenStream, text, 1, "..."); System.out.println(rv); } public static void main(String[] args) { System.out.println("Starting..."); try { PhraseTest pt = new PhraseTest(); } catch(Exception ex) { ex.printStackTrace(); } } } The output I'm getting is instead of highlighting <B>Jimbo John</B> it does <B>Jimbo</B> then <B>John</B>. Can I get around this some how? I tried several different query types (they are declared in the code, but only the parsed version is being used). Thanks -max