Since no one answered this, I decided I'd answer it myself (in case anyone else wanted the answer).
First, there are two types of filters you can use in an Analyzer -- Character filters and token filters. Character filters get applied before tokenization and token filters get applied after tokenization. So, my question was really nonsensical. The HTMLStripCharFilter is a character filter and therefore gets applied to the html data before it goes to the tokenizer. You can then apply any tokenizer you wish (including StandardTokenizer). There is one caveat you might want to be aware of when using the HTMLStripCharFilter and then highlighting search terms. Assume you strip the html characters with the HTMLStripCharFilter and then use the standard tokenizer. Now you run it through the highlighter. If there were other html tags (besides whatever you are using for highlighting - <b> by default), then you can have cases where your tags won't be properly nested. For example you could end up with: Now is <span class="underline">the <b>time</span></b> for all good men to come... Note that the <b> isn't properly nested between the beginning and ending span. For straight html, I would assume the browser will work it out. However, if you are using xml, the document will become invalid. The problem is that the html highlight code appears to place the ending tag (the </b>) before the next word after the highlight term instead of after the marked word ("time"). This means that if there are any html tags that the HTMLStripCharFilter eliminated, the closing </b> will come after those characters instead of before. Admittedly, you can make up cases where the highlighter will get it right, but it appears to me that that only happens with phrases. For single words (the more likely case), the closing highlighting sequence (</b>) should be after the highlighted word. Regardless, it's impossible for the highlighter to get it right all the time and you may have to write code that goes in and fixes stuff up if you're using xml or your really anal about tags being properly nested. Cheers Scott -----Original Message----- From: Scott Smith [mailto:ssm...@mainstreamdata.com] Sent: Thursday, November 01, 2012 7:16 PM To: Michael Sokolov; java-user@lucene.apache.org Subject: RE: Highlighting html pages I was trying to play with this. Am I correct in assuming that this isn't going to work with the StandardTokenizer (since it appears to strip angle brackets among other things)? Does HTMLStripCharFilter expect a WhiteSpaceTokenizer or a CharacterTokenizer or ?? If I want to get rid of punctuation (commas, periods, semicolons, etc.) after the HTML stripping, is there a filter? Essentially, I want to get it back to what StandardTokenizer would give me after I've stripped the HTML. Suggestions? Scott -----Original Message----- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Tuesday, October 23, 2012 9:04 PM To: java-user@lucene.apache.org Cc: Scott Smith Subject: Re: Highlighting html pages If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, and remembering the word positions so that highlighting works properly. Should do exactly what you want out of the box... On 10/23/2012 8:00 PM, Scott Smith wrote: > I need to take an html page that I retrieve from my lucene search and > highlight all of the terms that are part of the search. I need to skip over > any html tags since I don't want any words in tags which happen to match the > search to be highlighted. > > Note that I don't want sections of the document. I need to highlight all > terms in the document (with a <span> or something similar) and get back the > entire document (with the new <span>s) so it can be displayed in its entirety > with the search terms highlighted. > > Last time I did this (in the days of 1.4.2 - so a while ago), I had to write > a custom tokenizer that skipped over the html tokens so that I didn't > accidentally highlight them. I'm hoping that there is an easier way to do > this now. > > Suggestions? > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org