Re: Highlighting html pages

Michael Sokolov Mon, 05 Nov 2012 05:21:39 -0800

HTMLStripCharFilter runs first, before any tokenizer, strips all the tags, and 
leaves all your text intact.  If you have angle brackets in the text (ie not 
tags), they will be left as is.  All your other analysis code should work just 
the same as if the text came from a plain text file.  Which tokenizer you want 
to use is up to you and has nothing to do with the CharFilter.


-Mike


On 11/1/2012 9:16 PM, Scott Smith wrote:

I was trying to play with this.  Am I correct in assuming that this isn't going 
to work with the StandardTokenizer (since it appears to strip angle brackets 
among other things)?  Does HTMLStripCharFilter expect a WhiteSpaceTokenizer or 
a CharacterTokenizer or ??

If I want to get rid of punctuation (commas, periods, semicolons, etc.) after 
the HTML stripping, is there a filter?  Essentially, I want to get it back to 
what StandardTokenizer would give me after I've stripped the HTML.

Suggestions?

Scott

-----Original Message-----
From: Michael Sokolov [mailto:soko...@ifactory.com]
Sent: Tuesday, October 23, 2012 9:04 PM
To: java-user@lucene.apache.org
Cc: Scott Smith
Subject: Re: Highlighting html pages

If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, 
and remembering the word positions so that highlighting works properly.  Should 
do exactly what you want out of the box...


On 10/23/2012 8:00 PM, Scott Smith wrote:

I need to take an html page  that I retrieve from my lucene search and 
highlight all of the terms that are part of the search.  I need to skip over 
any html tags since I don't want any words in tags which happen to match the 
search to be highlighted.

Note that I don't want sections of the document.  I need to highlight all terms in the 
document (with a <span> or something similar) and get back the entire document (with 
the new <span>s) so it can be displayed in its entirety with the search terms 
highlighted.

Last time I did this (in the days of 1.4.2 - so a while ago), I had to write a 
custom tokenizer that skipped over the html tokens so that I didn't 
accidentally highlight them.  I'm hoping that there is an easier way to do this 
now.

Suggestions?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Highlighting html pages

Reply via email to