Re: Highlighting text for queries with huge numbers of terms

Daniel Noll Sun, 19 Feb 2006 15:31:57 -0800

markharw00d wrote:

Swing supports HTML and will do the highlight for you.
SwingText="<html>"+highlighter.getBestFragment(tokenStream,text)+"</html>";
If you don't like that approach and really do just want to just know thepositions, plug in your own "Formatter" class which, instead of markingup the text, silently records the hit position information provided toit in the "TokenGroup" class and then return the original string withoutadding any markup. TokenGroup handles the issue of identifying runs ofoverlapping tokens for you.

Swing's HTML renderer is unfortunately too slow for our use (it tooksomething like 10 seconds to load and display a 100kB document withhighlights.) It's pretty ugly, too. Maybe that will change in version6.0, though.

The text renderer has a distinct advantage of being relatively fast forthat size, but also the highlighting can be done after the text isdisplayed and even in the background, which is a huge benefit.

The way I would be able to use the existing highlighter would probablybe to make a custom Formatter which takes a Swing Highlighter object anddoes the highlighting from there, and then run the highlighter in a newthread after the text is already displayed.


But...

Hoss, your psuedo code looked like a solution for identifying phrasequeries. Lack of proper support for phrase queries is a known issuewith the current highlighter but I thought the primary issue in questionhere was speed?

Actually, we do need support for phrase queries (which pretty much rulesout the existing highlighter code) but slop isn't as important.

Not sure this helps for non-phrase queries.

Indeed, the speed will be roughly identical for non-phrase queries sincelookups in a HashMap versus a HashSet would be pretty much identical.

Also, I don't think hitting the index to work outwhat terms were a hit for the doc in question in order to shorten thelist of terms to highlight is likely to speed up things. If anything,the extra disk IO is likely to slow it down.


That's a good point, particularly in the case of small documents.

We actually limit ourselves to 100k display at the moment because evenJTextArea gets pretty inefficient once you get up to 1MB text sizes.The nasty part is that it can't load the text itself in the background-- the text component remains completely white until the entire text ispresent. A custom Document implementation may be able to work aroundsome of that, but the last time I tried it, there was some flicker eachtime more text was appended. A completely custom component is probablythe way to go. :-)


Daniel


--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Highlighting text for queries with huge numbers of terms

Reply via email to