Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

Ahmet Arslan Mon, 07 Jun 2010 14:09:32 -0700

> I need to index HTML documents and one of the requirements
> is to highlight
> documents while maintaining all of the original formatting.
> The documents
> are relatively simple HTML, meaning no JavaScript code that
> changes elements
> at runtime or too fancy CSS styling.
> 
> I think it should be possible to write a tokenizer that
> strips out the HTML
> tags but maintains the original offsets within the HTML
> document so they
> can be used for highlighting the original HTML document,
> not just the
> text representation.
> 
> Does anybody know any tokenizers that can do this? It seems
> it's something
> other people may need too.
> 
> I am fairly new to Lucene so I may have chosen the wrong
> terminology but I
> hope this makes sense.


You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to add 
one or more org.apache.lucene.analysis.CharFilter(s) before tokenizer in your 
analyzer.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

Reply via email to