Hi, I need to index HTML documents and one of the requirements is to highlight documents while maintaining all of the original formatting. The documents are relatively simple HTML, meaning no JavaScript code that changes elements at runtime or too fancy CSS styling.
I think it should be possible to write a tokenizer that strips out the HTML tags but maintains the original offsets within the HTML document so they can be used for highlighting the original HTML document, not just the text representation. Does anybody know any tokenizers that can do this? It seems it's something other people may need too. I am fairly new to Lucene so I may have chosen the wrong terminology but I hope this makes sense. Hans