Re: Highlighter that works with phrase and span queries

Mark Miller Wed, 27 Jun 2007 12:33:55 -0700

I have not looked at any highlighting code yet. Is there already an extension
of PhraseQuery that has getSpans() ?

Currently I am using this code originally by M. Harwood:
           Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms();
           int i;
           SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length];


           for (i = 0; i < phraseQueryTerms.length; i++) {
               clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
           }

           SpanNearQuery sp = new SpanNearQuery(clauses,
                   ((PhraseQuery) query).getSlop(), false);
           sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit distance, butit approximates extremely well in most cases.

I wonder if this approach to Highlighting would be worth it in the end.Certainly, it would seem to require that you store offsets or you wouldhave to re-tokenize anyway.


Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the currentHighlighter if we grab from the source text in bigger chunks. Ronnie'sHighlighter appears to be faster than the original due to two things: hedoesn't have to re-tokenize text and he rebuilds the original documentin large pieces. Depending on how you want to look at it, he loses mostof the speed gained from just looking at the Query tokens instead of alltokens to pulling the Term offset information (which appears pretty slow).

If you use a SimpleAnalyzer on docs around 1800 tokens long, you canactually match the speed of Ronnies highlighter with the currenthighlighter if you just rebuild the highlighted documents in biggerpieces i.e. instead of going through each token and adding the sourcetext that it covers, build up the offset information until you getanother hit and then pull from the source text into the highlighted textin one big piece rather than a tokens worth at a time. Of course this isnot compatible with the way the Fragmenter currently works. If you usethe StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighterwins because it takes so darn long to re-analyze.

It is also interesting to note that it is very difficult to see in again in using TokenSources to build a TokenStream. Using theStandardAnalyzer, it takes docs that are 1800 tokens just to be as fastas re-analyzing. Notice I didn't say fast, but "as fast". Anythingsmaller, or if you're using a simpler analyzer, and TokenSources iscertainly not worth it. It just takes too long to pull TermVector info.


- Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Highlighter that works with phrase and span queries

Reply via email to