RE: Highlighting html pages

Scott Smith Mon, 05 Nov 2012 16:07:13 -0800

Since no one answered this, I decided I'd answer it myself (in case anyone else 
wanted the answer).

First, there are two types of filters you can use in an Analyzer -- Character 
filters and token filters.  Character filters get applied before tokenization 
and token filters get applied after tokenization.  

So, my question was really nonsensical.  The HTMLStripCharFilter is a character 
filter and therefore gets applied to the html data before it goes to the 
tokenizer.  You can then apply any tokenizer you wish (including 
StandardTokenizer).

There is one caveat you might want to be aware of when using the 
HTMLStripCharFilter and then highlighting search terms.  Assume you strip the 
html characters with the HTMLStripCharFilter and then use the standard 
tokenizer.  Now you run it through the highlighter.  If there were other html 
tags (besides whatever you are using for highlighting - <b> by default), then 
you can have cases where your tags won't be properly nested. 

For example you could end up with:

        Now is <span class="underline">the <b>time</span></b> for all good men 
to come... 

Note that the <b> isn't properly nested between the beginning and ending span.  
For straight html, I would assume the browser will work it out.  However, if 
you are using xml, the document will become invalid.  The problem is that the 
html highlight code appears to place the ending tag (the </b>) before the next 
word after the highlight term instead of after the marked word ("time").  This 
means that if there are any html tags that the HTMLStripCharFilter eliminated, 
the closing </b> will come after those characters instead of before.

Admittedly, you can make up cases where the highlighter will get it right, but 
it appears to me that that only happens with phrases.  For single words (the 
more likely case), the closing highlighting sequence (</b>) should be after the 
highlighted word.  Regardless, it's impossible for the highlighter to get it 
right all the time and you may have to write code that goes in and fixes stuff 
up if you're using xml or your really anal about tags being properly nested.

Cheers

Scott

-----Original Message-----
From: Scott Smith [mailto:ssm...@mainstreamdata.com] 
Sent: Thursday, November 01, 2012 7:16 PM
To: Michael Sokolov; java-user@lucene.apache.org
Subject: RE: Highlighting html pages

I was trying to play with this.  Am I correct in assuming that this isn't going 
to work with the StandardTokenizer (since it appears to strip angle brackets 
among other things)?  Does HTMLStripCharFilter expect a WhiteSpaceTokenizer or 
a CharacterTokenizer or ??  

If I want to get rid of punctuation (commas, periods, semicolons, etc.) after 
the HTML stripping, is there a filter?  Essentially, I want to get it back to 
what StandardTokenizer would give me after I've stripped the HTML.

Suggestions?

Scott

-----Original Message-----
From: Michael Sokolov [mailto:soko...@ifactory.com] 
Sent: Tuesday, October 23, 2012 9:04 PM
To: java-user@lucene.apache.org
Cc: Scott Smith
Subject: Re: Highlighting html pages

If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, 
and remembering the word positions so that highlighting works properly.  Should 
do exactly what you want out of the box...

On 10/23/2012 8:00 PM, Scott Smith wrote:
> I need to take an html page  that I retrieve from my lucene search and 
> highlight all of the terms that are part of the search.  I need to skip over 
> any html tags since I don't want any words in tags which happen to match the 
> search to be highlighted.
>
> Note that I don't want sections of the document.  I need to highlight all 
> terms in the document (with a <span> or something similar) and get back the 
> entire document (with the new <span>s) so it can be displayed in its entirety 
> with the search terms highlighted.
>
> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write 
> a custom tokenizer that skipped over the html tokens so that I didn't 
> accidentally highlight them.  I'm hoping that there is an easier way to do 
> this now.
>
> Suggestions?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Highlighting html pages

Reply via email to