Hi Daniel/Chris,
Unfortunately, the contrib/highlighter code in source control fails to
meet our needs in two ways:
1. We don't just want fragments, we want *all* of the text, with
highlights in the appropriate places (although we do offer a means
to display just the fragments as well), and
Pass a "NullFragmenter" to the highlighter constructor to turn off
fragmentation.
2. We don't deal with HTML, just plain text on a Swing text component.
In other words we don't have to "format" or modify the text at all,
just tell the Swing component which bits need to be highlighted.
Swing supports HTML and will do the highlight for you.
SwingText="<html>"+highlighter.getBestFragment(tokenStream,text)+"</html>";
If you don't like that approach and really do just want to just know the
positions, plug in your own "Formatter" class which, instead of marking
up the text, silently records the hit position information provided to
it in the "TokenGroup" class and then return the original string without
adding any markup. TokenGroup handles the issue of identifying runs of
overlapping tokens for you.
Hoss, your psuedo code looked like a solution for identifying phrase
queries. Lack of proper support for phrase queries is a known issue
with the current highlighter but I thought the primary issue in question
here was speed? The approach taken by the current highlighter is to
maintain a HashSet of all unique query terms and check each token in the
text's token stream for a hit on this set. As your code suggests, this
could be made faster if there were multiple queries all of which were
PhraseQueries (with no slop factor!) because you would only need to
check each phrase's "first terms" initially. Not sure this helps for
non-phrase queries. Also, I don't think hitting the index to work out
what terms were a hit for the doc in question in order to shorten the
list of terms to highlight is likely to speed up things. If anything,
the extra disk IO is likely to slow it down.
With regards to the quesiton of overlapping tokens - the highlighter is
robust in the face of marking these up.
Cheers
Mark
___________________________________________________________
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]