Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

Hans Merkl Mon, 07 Jun 2010 11:48:01 -0700

Hi,
I need to index HTML documents and one of the requirements is to highlight
documents while maintaining all of the original formatting. The documents
are relatively simple HTML, meaning no JavaScript code that changes elements
at runtime or too fancy CSS styling.


I think it should be possible to write a tokenizer that strips out the HTML
tags but maintains the original offsets within the HTML document so they
can be used for highlighting the original HTML document, not just the
text representation.

Does anybody know any tokenizers that can do this? It seems it's something
other people may need too.

I am fairly new to Lucene so I may have chosen the wrong terminology but I
hope this makes sense.

Hans

Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

Reply via email to