Re: Highlighter that works with phrase and span queries

Mark Miller Thu, 21 Jun 2007 14:20:21 -0700

Results of my tests :

The new SpanScorer is just about the same speed as the old Highlighter'sQueryScorer if the Query contains no position sensitive elements. Thisis only the case, however, if the CachingTokenFilter (from the analysispackage) is changed to use an ArrayList instead of a LinkedList(LinkedList is the wrong data structure for this use).

The new SpanScorer is obviously not as fast when dealing withPhraseQuerys and SpanQuerys. It takes about twice the time or less tocorrectly highlight these Query types when compared to just highlightingevery term from the Span or Phrase Query.

As far as my comment about using TokenSources to generate theTokenStream -- for short docs (5-15k) and a quick analyzer, it takesjust as long to get TermVector info and sort it as to re-analyze thetext. As docs get bigger, you will certainly start to see some benefitsthough.

Besides the re-analyzing, the reason the Highlighter is slow for verylarge documents is that every token from a doc is examined and scoredand then the doc is rebuilt piece by piece from each token. This doesnot scale well <g> (Especially in combination with re-analyzer text).While there is probably room for improvement here, I don't see thecurrent approach ever reaching the speed of Ronnie Kolehmainen'sHighlighter. However, I also don't see Ronnie's Highlighter ever beingas flexible or able to cover as many use cases as the current contribHighlighter approach.

Ronnie's approach works well for large docs because it only deals withthe tokens of interest (tokens from the Query) rather than scoring everytoken. This is done using TermVectorOffsets and so immediately requiresthat you have turned those on. Further, I don't see this approach everbeing easily extended to supporting Spans or PhraseQuerys withoutreimplementing special Lucene search logic. Those are the reasons that Ichose not to work with Ronnie's Highlighter (particularly, my mainmotivation was Spans support). Also, to be able to ignore fields whenHighlighting, Ronnnie's Highlighter would likely have to try and grabTermVectorOffsets for every field in the index for each Term in theQuery. If you have a lot of fields this could be much slower. Evenstill, it can't be argued that Ronnie's Highlighter is not dramaticallyfaster than the current Highlighter on very large documents, especiallyif your currently re-analyzing text. It is probably the perfect optionfor Otis's client if he has TermVectorOffsets on in his index or caneasily re-index.

I am currently trying to think of some possible hybrid approach tohighlighting...

I have fixed a few things with my Highlighter and will be updating itsoon. Also, I forgot to mention, there is the dependency on MemoryIndex.


- Mark



mark harwood wrote:

While we're considering highlighter performance there was some discussion of 
this around another implementation here: 
http://issues.apache.org/jira/browse/LUCENE-644

Ronnie Kolehmainen's implementation was proven faster than the current contrib 
highlighter but was almost certainly missing some of the features/support for 
edge cases.

There are certainly some optimisations in the existing implementation that should be 
possible. Not building StringBuffers for document fragments with no hits seems an obvious 
step. Whether this can be done while preserving the existing "helper" 
interfaces (Fragmenter/Scorer) remains to be seen.

Cheers,
Mark


----- Original Message ----
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 21 June, 2007 2:11:52 AM
Subject: Re: Highlighter that works with phrase and span queries
I will work up some performance numbers over the next day or two toshare with you. I have spent the last day or two with a profiler tryingto find the biggest performance drains.
Unfortunately, I will probably not be able to squeeze out much moreperformance than the current Highlighter. When I started working on thisproject I considered starting from scratch to create a better, moreaccurate Highlighter. After some initial work I quickly came to therealization that Mark Harwood (with some additions by others) hadalready solved too many corner cases and interesting needs. The fewalternate Highlighters in JIRA did not meet the standards set by Mark'shighlighter. Trying to replicate all that work in a different mannerdidn't seem like a fruitful approach -- Harwood is more clever than I <g>
Taking that into account, I decided to extend the Highlighter using thegreat framework that is already in place. I implemented a new Scorerthat acts much like the default Scorer, but when it finds a Query clausethat is position sensitive (PhraseQuery, SpanQuery), it creates aMemoryIndex that is used extract the correct Spans for the Query (Creditto Paul Elschot and Mark Harwood for the approach). Non positionsensitive Query claueses are handled similar to the way they where inthe original highlighter's Scorer. This means that non positionsensitive queries are likely the same speed as before, while positionsensitive queries are likely a bit slower. For my uses, the thing isdamned fast -- of course my uses involves small documents (Newspaperarticles).
I am very interested in making this thing as fast as possible though, soI will build some benchmark tests and try to squeeze as much performanceout of the Highligher as I can. I will also see if my Scorer is anyfaster than the original.
All that said, my guess is that one of the slowest parts of Highlightingis re-tokenizing the text. There is always the option of turning onTermVectors and using org.apache.lucene.search.highlight.TokenSources toget the TokenStream. Based on Mark H's comments, it may be twice as fastas re-tokenizing. This method can also be used with my new Highlightercode as well (which is more a plug-in to the old Highlighter than areplacement.)
Considering that both of your comments immediately went to performance,I will certainly be spending some time working on this.
- Mark
Hi Mark,

I know one large user (meaning: high query/highlight rates) of the current 
Highlighter and this user wasn't too happy with its performance.  I don't know 
the details, other than it was inefficient.  So now I'm wondering if you've 
benchmarked your Highlighter against that/current Highlighter to see not only 
which one is more accurate, but also which one is faster, and by how much?

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
This is really great, Mark.  I'll look into integrating it with Solr,
as better phrase highlighting is a definite sore point for some of our
users.



Any indication on performance differences?



cheers,

-mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
___________________________________________________________Yahoo! Mail is the world's favourite email. Don't settle for less, sign up foryour free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Highlighter that works with phrase and span queries

Reply via email to