Hi,
I need to index HTML documents and one of the requirements is to highlight
documents while maintaining all of the original formatting. The documents
are relatively simple HTML, meaning no JavaScript code that changes elements
at runtime or too fancy CSS styling.

I think it should be possible to write a tokenizer that strips out the HTML
tags but maintains the original offsets within the HTML document so they
can be used for highlighting the original HTML document, not just the
text representation.

Does anybody know any tokenizers that can do this? It seems it's something
other people may need too.

I am fairly new to Lucene so I may have chosen the wrong terminology but I
hope this makes sense.

Hans

Reply via email to