Max Lynch wrote:
Well what happens is if I use a SpanScorer instead, and allocate it like

such:

           analyzer = StandardAnalyzer([])
           tokenStream = analyzer.tokenStream("contents",
lucene.StringReader(text))
           ctokenStream = lucene.CachingTokenFilter(tokenStream)
           highlighter = lucene.Highlighter(formatter,
lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
           ctokenStream.reset()

           result = highlighter.getBestFragments(ctokenStream, text,
                   2, "...")

 My highlighter is still breaking up words inside of a span.  For
example,
if I search for \"John Smith\", instead of the highlighter being called
for
the whole "John Smith", it gets called for "John" and then "Smith".
I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
which is the default used by Highlighter) to ensure that each fragment
contains a full match for the query.  EG something like this (copied
from LIA 2nd edition):

   TermQuery query = new TermQuery(new Term("field", "fox"));

   TokenStream tokenStream =
       new SimpleAnalyzer().tokenStream("field",
           new StringReader(text));

   SpanScorer scorer = new SpanScorer(query, "field",
                                      new CachingTokenFilter(tokenStream));
   Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
   Highlighter highlighter = new Highlighter(scorer);
   highlighter.setTextFragmenter(fragmenter);



Okay, I hacked something up in Java that illustrates my issue.


import org.apache.lucene.search.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.spans.SpanTermQuery;
import java.io.Reader;
import java.io.StringReader;

public class PhraseTest {
    private IndexSearcher searcher;
    private RAMDirectory directory;

    public PhraseTest() throws Exception {
        directory = new RAMDirectory();

        Analyzer analyzer = new Analyzer() {
            public TokenStream tokenStream(String fieldName, Reader reader)
{
                return new WhitespaceTokenizer(reader);
            }

            public int getPositionIncrementGap(String fieldName) {
                return 100;
            }
        };

        IndexWriter writer = new IndexWriter(directory, analyzer, true,
                IndexWriter.MaxFieldLength.LIMITED);

        Document doc = new Document();
        String text = "Jimbo John is his name";
        doc.add(new Field("contents", text, Field.Store.YES,
Field.Index.ANALYZED));
        writer.addDocument(doc);

        writer.optimize();
        writer.close();

        searcher = new IndexSearcher(directory);

        // Try a phrase query
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term("contents", "Jimbo"));
        phraseQuery.add(new Term("contents", "John"));

        // Try a SpanTermQuery
        SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
"Jimbo John"));

        // Try a parsed query
        Query parsedQuery = new QueryParser("contents",
analyzer).parse("\"Jimbo John\"");

        Hits hits = searcher.search(parsedQuery);
        System.out.println("We found " + hits.length() + " hits.");

        // Highlight the results
        CachingTokenFilter tokenStream = new
CachingTokenFilter(analyzer.tokenStream( "contents", new
StringReader(text)));

        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();

        SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
"contents");

        Highlighter highlighter = new Highlighter(formatter, sc);
        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
        tokenStream.reset();

        String rv = highlighter.getBestFragments(tokenStream, text, 1,
"...");
        System.out.println(rv);

    }
    public static void main(String[] args) {
        System.out.println("Starting...");
        try {
            PhraseTest pt = new PhraseTest();
        } catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}



The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
<B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I tried
several different query types (they are declared in the code, but only the
parsed version is being used).

Thanks
-max

Sorry, not much you can do at the moment. The change is non trivial for sure (its probably easier to write some regex that merges them). This limitation was accepted because with most markup, it will display the same anyway. An option to merge would be great, and while I don't remember the details, the last time I looked, it just ain't easy to do based on the implementation. The highlighter highlights by running through and scoring tokens, not phrases, and the Span highlighter asks if a given token is in a given span to see if it should get a score over 0. Token by token handed off to the SpanScorer to be scored. I looked into adding the option at one point (back when I was putting the SpanScorer together) and didn't find it worth the effort after getting blocked a couple times.


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to