On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote:
> Max Lynch wrote: > >> Well what happens is if I use a SpanScorer instead, and allocate it like >>> >>> >> >> >> >>> such: >>>> >>>> analyzer = StandardAnalyzer([]) >>>> tokenStream = analyzer.tokenStream("contents", >>>> lucene.StringReader(text)) >>>> ctokenStream = lucene.CachingTokenFilter(tokenStream) >>>> highlighter = lucene.Highlighter(formatter, >>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream)) >>>> ctokenStream.reset() >>>> >>>> result = highlighter.getBestFragments(ctokenStream, text, >>>> 2, "...") >>>> >>>> My highlighter is still breaking up words inside of a span. For >>>> >>>> >>> example, >>> >>> >>>> if I search for \"John Smith\", instead of the highlighter being called >>>> >>>> >>> for >>> >>> >>>> the whole "John Smith", it gets called for "John" and then "Smith". >>>> >>>> >>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter, >>> which is the default used by Highlighter) to ensure that each fragment >>> contains a full match for the query. EG something like this (copied >>> from LIA 2nd edition): >>> >>> TermQuery query = new TermQuery(new Term("field", "fox")); >>> >>> TokenStream tokenStream = >>> new SimpleAnalyzer().tokenStream("field", >>> new StringReader(text)); >>> >>> SpanScorer scorer = new SpanScorer(query, "field", >>> new >>> CachingTokenFilter(tokenStream)); >>> Fragmenter fragmenter = new SimpleSpanFragmenter(scorer); >>> Highlighter highlighter = new Highlighter(scorer); >>> highlighter.setTextFragmenter(fragmenter); >>> >>> >> >> >> >> Okay, I hacked something up in Java that illustrates my issue. >> >> >> import org.apache.lucene.search.*; >> import org.apache.lucene.analysis.*; >> import org.apache.lucene.document.*; >> import org.apache.lucene.index.IndexWriter; >> import org.apache.lucene.analysis.standard.StandardAnalyzer; >> import org.apache.lucene.index.Term; >> import org.apache.lucene.queryParser.QueryParser; >> import org.apache.lucene.store.Directory; >> import org.apache.lucene.store.RAMDirectory; >> import org.apache.lucene.search.highlight.*; >> import org.apache.lucene.search.spans.SpanTermQuery; >> import java.io.Reader; >> import java.io.StringReader; >> >> public class PhraseTest { >> private IndexSearcher searcher; >> private RAMDirectory directory; >> >> public PhraseTest() throws Exception { >> directory = new RAMDirectory(); >> >> Analyzer analyzer = new Analyzer() { >> public TokenStream tokenStream(String fieldName, Reader reader) >> { >> return new WhitespaceTokenizer(reader); >> } >> >> public int getPositionIncrementGap(String fieldName) { >> return 100; >> } >> }; >> >> IndexWriter writer = new IndexWriter(directory, analyzer, true, >> IndexWriter.MaxFieldLength.LIMITED); >> >> Document doc = new Document(); >> String text = "Jimbo John is his name"; >> doc.add(new Field("contents", text, Field.Store.YES, >> Field.Index.ANALYZED)); >> writer.addDocument(doc); >> >> writer.optimize(); >> writer.close(); >> >> searcher = new IndexSearcher(directory); >> >> // Try a phrase query >> PhraseQuery phraseQuery = new PhraseQuery(); >> phraseQuery.add(new Term("contents", "Jimbo")); >> phraseQuery.add(new Term("contents", "John")); >> >> // Try a SpanTermQuery >> SpanTermQuery spanTermQuery = new SpanTermQuery(new >> Term("contents", >> "Jimbo John")); >> >> // Try a parsed query >> Query parsedQuery = new QueryParser("contents", >> analyzer).parse("\"Jimbo John\""); >> >> Hits hits = searcher.search(parsedQuery); >> System.out.println("We found " + hits.length() + " hits."); >> >> // Highlight the results >> CachingTokenFilter tokenStream = new >> CachingTokenFilter(analyzer.tokenStream( "contents", new >> StringReader(text))); >> >> SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(); >> >> SpanScorer sc = new SpanScorer(parsedQuery, "contents", >> tokenStream, >> "contents"); >> >> Highlighter highlighter = new Highlighter(formatter, sc); >> highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc)); >> tokenStream.reset(); >> >> String rv = highlighter.getBestFragments(tokenStream, text, 1, >> "..."); >> System.out.println(rv); >> >> } >> public static void main(String[] args) { >> System.out.println("Starting..."); >> try { >> PhraseTest pt = new PhraseTest(); >> } catch(Exception ex) { >> ex.printStackTrace(); >> } >> } >> } >> >> >> >> The output I'm getting is instead of highlighting <B>Jimbo John</B> it >> does >> <B>Jimbo</B> then <B>John</B>. Can I get around this some how? I tried >> several different query types (they are declared in the code, but only the >> parsed version is being used). >> >> Thanks >> -max >> >> >> > Sorry, not much you can do at the moment. The change is non trivial for > sure (its probably easier to write some regex that merges them). This > limitation was accepted because with most markup, it will display the same > anyway. An option to merge would be great, and while I don't remember the > details, the last time I looked, it just ain't easy to do based on the > implementation. The highlighter highlights by running through and scoring > tokens, not phrases, and the Span highlighter asks if a given token is in a > given span to see if it should get a score over 0. Token by token handed off > to the SpanScorer to be scored. I looked into adding the option at one point > (back when I was putting the SpanScorer together) and didn't find it worth > the effort after getting blocked a couple times. > > Thanks anyways Mark. Yea what I gathered from the results is that I will only get hits and highlights for phrases if the whole phrase was found, but they will be separated. I just combine them now but was hoping for a more elegant solution. At least I know that what I'm highlighting aren't random parts of the text, but the actual phrase, so all is not lost. -max