While we're considering highlighter performance there was some discussion of this around another implementation here: http://issues.apache.org/jira/browse/LUCENE-644
Ronnie Kolehmainen's implementation was proven faster than the current contrib highlighter but was almost certainly missing some of the features/support for edge cases. There are certainly some optimisations in the existing implementation that should be possible. Not building StringBuffers for document fragments with no hits seems an obvious step. Whether this can be done while preserving the existing "helper" interfaces (Fragmenter/Scorer) remains to be seen. Cheers, Mark ----- Original Message ---- From: Mark Miller <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 21 June, 2007 2:11:52 AM Subject: Re: Highlighter that works with phrase and span queries I will work up some performance numbers over the next day or two to share with you. I have spent the last day or two with a profiler trying to find the biggest performance drains. Unfortunately, I will probably not be able to squeeze out much more performance than the current Highlighter. When I started working on this project I considered starting from scratch to create a better, more accurate Highlighter. After some initial work I quickly came to the realization that Mark Harwood (with some additions by others) had already solved too many corner cases and interesting needs. The few alternate Highlighters in JIRA did not meet the standards set by Mark's highlighter. Trying to replicate all that work in a different manner didn't seem like a fruitful approach -- Harwood is more clever than I <g> Taking that into account, I decided to extend the Highlighter using the great framework that is already in place. I implemented a new Scorer that acts much like the default Scorer, but when it finds a Query clause that is position sensitive (PhraseQuery, SpanQuery), it creates a MemoryIndex that is used extract the correct Spans for the Query (Credit to Paul Elschot and Mark Harwood for the approach). Non position sensitive Query claueses are handled similar to the way they where in the original highlighter's Scorer. This means that non position sensitive queries are likely the same speed as before, while position sensitive queries are likely a bit slower. For my uses, the thing is damned fast -- of course my uses involves small documents (Newspaper articles). I am very interested in making this thing as fast as possible though, so I will build some benchmark tests and try to squeeze as much performance out of the Highligher as I can. I will also see if my Scorer is any faster than the original. All that said, my guess is that one of the slowest parts of Highlighting is re-tokenizing the text. There is always the option of turning on TermVectors and using org.apache.lucene.search.highlight.TokenSources to get the TokenStream. Based on Mark H's comments, it may be twice as fast as re-tokenizing. This method can also be used with my new Highlighter code as well (which is more a plug-in to the old Highlighter than a replacement.) Considering that both of your comments immediately went to performance, I will certainly be spending some time working on this. - Mark > Hi Mark, > > I know one large user (meaning: high query/highlight rates) of the current > Highlighter and this user wasn't too happy with its performance. I don't > know the details, other than it was inefficient. So now I'm wondering if > you've benchmarked your Highlighter against that/current Highlighter to see > not only which one is more accurate, but also which one is faster, and by how > much? > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > This is really great, Mark. I'll look into integrating it with Solr, > as better phrase highlighting is a definite sore point for some of our > users. > > > > Any indication on performance differences? > > > > cheers, > > -mike > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___________________________________________________________ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]