Re: Phrase Highlighting

Mark Miller Wed, 03 Jun 2009 17:35:08 -0700

Max Lynch wrote:

Well what happens is if I use a SpanScorer instead, and allocate it like

such:

           analyzer = StandardAnalyzer([])
           tokenStream = analyzer.tokenStream("contents",
lucene.StringReader(text))
           ctokenStream = lucene.CachingTokenFilter(tokenStream)
           highlighter = lucene.Highlighter(formatter,
lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
           ctokenStream.reset()

           result = highlighter.getBestFragments(ctokenStream, text,
                   2, "...")

 My highlighter is still breaking up words inside of a span.  For

example,

if I search for \"John Smith\", instead of the highlighter being called

for

the whole "John Smith", it gets called for "John" and then "Smith".

I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
which is the default used by Highlighter) to ensure that each fragment
contains a full match for the query.  EG something like this (copied
from LIA 2nd edition):

   TermQuery query = new TermQuery(new Term("field", "fox"));

   TokenStream tokenStream =
       new SimpleAnalyzer().tokenStream("field",
           new StringReader(text));

   SpanScorer scorer = new SpanScorer(query, "field",
                                      new CachingTokenFilter(tokenStream));
   Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
   Highlighter highlighter = new Highlighter(scorer);
   highlighter.setTextFragmenter(fragmenter);




Okay, I hacked something up in Java that illustrates my issue.


import org.apache.lucene.search.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.spans.SpanTermQuery;
import java.io.Reader;
import java.io.StringReader;

public class PhraseTest {
    private IndexSearcher searcher;
    private RAMDirectory directory;

    public PhraseTest() throws Exception {
        directory = new RAMDirectory();

        Analyzer analyzer = new Analyzer() {
            public TokenStream tokenStream(String fieldName, Reader reader)
{
                return new WhitespaceTokenizer(reader);
            }

            public int getPositionIncrementGap(String fieldName) {
                return 100;
            }
        };

        IndexWriter writer = new IndexWriter(directory, analyzer, true,
                IndexWriter.MaxFieldLength.LIMITED);

        Document doc = new Document();
        String text = "Jimbo John is his name";
        doc.add(new Field("contents", text, Field.Store.YES,
Field.Index.ANALYZED));
        writer.addDocument(doc);

        writer.optimize();
        writer.close();

        searcher = new IndexSearcher(directory);

        // Try a phrase query
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term("contents", "Jimbo"));
        phraseQuery.add(new Term("contents", "John"));

        // Try a SpanTermQuery
        SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
"Jimbo John"));

        // Try a parsed query
        Query parsedQuery = new QueryParser("contents",
analyzer).parse("\"Jimbo John\"");

        Hits hits = searcher.search(parsedQuery);
        System.out.println("We found " + hits.length() + " hits.");

        // Highlight the results
        CachingTokenFilter tokenStream = new
CachingTokenFilter(analyzer.tokenStream( "contents", new
StringReader(text)));

        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();

        SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
"contents");

        Highlighter highlighter = new Highlighter(formatter, sc);
        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
        tokenStream.reset();

        String rv = highlighter.getBestFragments(tokenStream, text, 1,
"...");
        System.out.println(rv);

    }
    public static void main(String[] args) {
        System.out.println("Starting...");
        try {
            PhraseTest pt = new PhraseTest();
        } catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}



The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
<B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
several different query types (they are declared in the code, but only the
parsed version is being used).

Thanks
-max

Sorry, not much you can do at the moment. The change is non trivial forsure (its probably easier to write some regex that merges them). Thislimitation was accepted because with most markup, it will display thesame anyway. An option to merge would be great, and while I don'tremember the details, the last time I looked, it just ain't easy to dobased on the implementation. The highlighter highlights by runningthrough and scoring tokens, not phrases, and the Span highlighter asksif a given token is in a given span to see if it should get a score over0. Token by token handed off to the SpanScorer to be scored. I lookedinto adding the option at one point (back when I was putting theSpanScorer together) and didn't find it worth the effort after gettingblocked a couple times.



--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Phrase Highlighting

Reply via email to