I will work up some performance numbers over the next day or two to
share with you. I have spent the last day or two with a profiler trying
to find the biggest performance drains.
Unfortunately, I will probably not be able to squeeze out much more
performance than the current Highlighter. When I started working on this
project I considered starting from scratch to create a better, more
accurate Highlighter. After some initial work I quickly came to the
realization that Mark Harwood (with some additions by others) had
already solved too many corner cases and interesting needs. The few
alternate Highlighters in JIRA did not meet the standards set by Mark's
highlighter. Trying to replicate all that work in a different manner
didn't seem like a fruitful approach -- Harwood is more clever than I <g>
Taking that into account, I decided to extend the Highlighter using the
great framework that is already in place. I implemented a new Scorer
that acts much like the default Scorer, but when it finds a Query clause
that is position sensitive (PhraseQuery, SpanQuery), it creates a
MemoryIndex that is used extract the correct Spans for the Query (Credit
to Paul Elschot and Mark Harwood for the approach). Non position
sensitive Query claueses are handled similar to the way they where in
the original highlighter's Scorer. This means that non position
sensitive queries are likely the same speed as before, while position
sensitive queries are likely a bit slower. For my uses, the thing is
damned fast -- of course my uses involves small documents (Newspaper
articles).
I am very interested in making this thing as fast as possible though, so
I will build some benchmark tests and try to squeeze as much performance
out of the Highligher as I can. I will also see if my Scorer is any
faster than the original.
All that said, my guess is that one of the slowest parts of Highlighting
is re-tokenizing the text. There is always the option of turning on
TermVectors and using org.apache.lucene.search.highlight.TokenSources to
get the TokenStream. Based on Mark H's comments, it may be twice as fast
as re-tokenizing. This method can also be used with my new Highlighter
code as well (which is more a plug-in to the old Highlighter than a
replacement.)
Considering that both of your comments immediately went to performance,
I will certainly be spending some time working on this.
- Mark
Hi Mark,
I know one large user (meaning: high query/highlight rates) of the current
Highlighter and this user wasn't too happy with its performance. I don't know
the details, other than it was inefficient. So now I'm wondering if you've
benchmarked your Highlighter against that/current Highlighter to see not only
which one is more accurate, but also which one is faster, and by how much?
Thanks,
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
This is really great, Mark. I'll look into integrating it with Solr,
as better phrase highlighting is a definite sore point for some of our
users.
Any indication on performance differences?
cheers,
-mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]