Hi Uwe, thanks for this hint. I'm not sure, how much of the Solr functionality do I need to implement for using the HTTPStripCharFilter. I'm using Apache Tika for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my IndexWriter. I don't use a Tokenizer - this would be the Solr approach?
At this point, I'm not sure, how to use the Solr within my application, where I already use Lucene. Can I use i.e. just this one or few classes from the Solr Core while indexing with Lucene IndexWriter? Or do I need to switch my indexing and searching to the Solr way, just to get what I need (highlighting of the hits within HTML files). Thank you so much for your help:-) Karo On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat < karolina.ber...@googlemail.com> wrote: > Hi all, > > I'm new to Lucene and have a question about indexing/highlighting of HTML > files with Lucene. > > What I need to do is highlight the hits (terms) in the original HTML file > (or get the positions of the terms/tokens in the original file). > This problem has already been described by Fred Toth in this thread in 2005 > (Preserving original HTML file offsets for highlighting, need > HTMLTokenizer?): > > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c6.2.1.2.2.20050530134630.063ae...@fast.synernet.com%3E > > I've searched the mailing list archives hoping for an answer, but I had no > luck. > > Does anyone have an idea, if there is a solution for this problem? Also if > you know, that it's not possible with Lucene to highlight the hits in the > original html-file, it would be helpful to know (I could stop looking for > it...). > > Many thanks in advance! > Karo > > P.S. Actually I wanted to answer the original thred/question from 2005 - is > there a way to do this? How can I post an answer to an old thread/mail from > the mailing list? >