> I need to index HTML documents and one of the requirements > is to highlight > documents while maintaining all of the original formatting. > The documents > are relatively simple HTML, meaning no JavaScript code that > changes elements > at runtime or too fancy CSS styling. > > I think it should be possible to write a tokenizer that > strips out the HTML > tags but maintains the original offsets within the HTML > document so they > can be used for highlighting the original HTML document, > not just the > text representation. > > Does anybody know any tokenizers that can do this? It seems > it's something > other people may need too. > > I am fairly new to Lucene so I may have chosen the wrong > terminology but I > hope this makes sense.
You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to add one or more org.apache.lucene.analysis.CharFilter(s) before tokenizer in your analyzer. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org