Hi Uwe, thank you so much for your help, it worked like a dream!:-)
I made a custom analyzer classand extended it from the StandardAnalyzer. Then I needed to override the tokenStream method like that: public TokenStream tokenStream(String fieldName, Reader reader) { CharStream chStream = CharReader.get(reader); HTMLStripCharFilter filter = new HTMLStripCharFilter(chStream); return super.tokenStream(fieldName, filter); } and in the constructor I called the super constructor. That worked really good and it was the ony place I needed to make some changes. Thanks once again! Viele Grüße aus Hamburg, Karo On Tue, Jan 25, 2011 at 2:15 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi Karolina, > > for this no Solr is needed at all. The CharFilter is simply placed outside > Lucene, but you can use without anything else from Solr. You can copy the > java file from Solr's source, choose another package name and you are > finished. > > About Tokenizer and Analyzer: StandardAnalyzer does the combination of > Tokenizers and TokenFilters (and possibly CharFilters). It is just an > easy-to-use class that serves as a factory for TokenStreams (which is the > superclass of Tokenizers). If you want your own analysis, you have to > implement an Analyzer class (possibly use StandardAnalyzer source code as > basis) and add the needed Filters (this HTMLStripCharFilter) to the factory > method. > > You may read the analysis' package javadocs to get information how to do > this. Note: This HTMLStripCharFilter does not need TIKA at all). > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com] > > Sent: Tuesday, January 25, 2011 1:45 PM > > To: java-user@lucene.apache.org > > Subject: Re: Preserving original HTML file offsets for highlighting > > > > Hi Uwe, > > > > thanks for this hint. I'm not sure, how much of the Solr functionality do > I > > need to implement for using the HTTPStripCharFilter. I'm using Apache > Tika > > for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my > > IndexWriter. I don't use a Tokenizer - this would be the Solr approach? > > > > At this point, I'm not sure, how to use the Solr within my application, > where I > > already use Lucene. Can I use i.e. just this one or few classes from the > Solr > > Core while indexing with Lucene IndexWriter? Or do I need to switch my > > indexing and searching to the Solr way, just to get what I need > (highlighting > > of the hits within HTML files). > > > > Thank you so much for your help:-) > > Karo > > > > > > > > On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat < > > karolina.ber...@googlemail.com> wrote: > > > > > Hi all, > > > > > > I'm new to Lucene and have a question about indexing/highlighting of > > > HTML files with Lucene. > > > > > > What I need to do is highlight the hits (terms) in the original HTML > > > file (or get the positions of the terms/tokens in the original file). > > > This problem has already been described by Fred Toth in this thread in > > > 2005 (Preserving original HTML file offsets for highlighting, need > > > HTMLTokenizer?): > > > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java- > > user/200505.mbox/ > > > %3c6.2.1.2.2.20050530134630.063ae...@fast.synernet.com%3E > > > > > > I've searched the mailing list archives hoping for an answer, but I > > > had no luck. > > > > > > Does anyone have an idea, if there is a solution for this problem? > > > Also if you know, that it's not possible with Lucene to highlight the > > > hits in the original html-file, it would be helpful to know (I could > > > stop looking for it...). > > > > > > Many thanks in advance! > > > Karo > > > > > > P.S. Actually I wanted to answer the original thred/question from 2005 > > > - is there a way to do this? How can I post an answer to an old > > > thread/mail from the mailing list? > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >