Hi Karolina, for this no Solr is needed at all. The CharFilter is simply placed outside Lucene, but you can use without anything else from Solr. You can copy the java file from Solr's source, choose another package name and you are finished.
About Tokenizer and Analyzer: StandardAnalyzer does the combination of Tokenizers and TokenFilters (and possibly CharFilters). It is just an easy-to-use class that serves as a factory for TokenStreams (which is the superclass of Tokenizers). If you want your own analysis, you have to implement an Analyzer class (possibly use StandardAnalyzer source code as basis) and add the needed Filters (this HTMLStripCharFilter) to the factory method. You may read the analysis' package javadocs to get information how to do this. Note: This HTMLStripCharFilter does not need TIKA at all). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com] > Sent: Tuesday, January 25, 2011 1:45 PM > To: java-user@lucene.apache.org > Subject: Re: Preserving original HTML file offsets for highlighting > > Hi Uwe, > > thanks for this hint. I'm not sure, how much of the Solr functionality do I > need to implement for using the HTTPStripCharFilter. I'm using Apache Tika > for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my > IndexWriter. I don't use a Tokenizer - this would be the Solr approach? > > At this point, I'm not sure, how to use the Solr within my application, where I > already use Lucene. Can I use i.e. just this one or few classes from the Solr > Core while indexing with Lucene IndexWriter? Or do I need to switch my > indexing and searching to the Solr way, just to get what I need (highlighting > of the hits within HTML files). > > Thank you so much for your help:-) > Karo > > > > On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat < > karolina.ber...@googlemail.com> wrote: > > > Hi all, > > > > I'm new to Lucene and have a question about indexing/highlighting of > > HTML files with Lucene. > > > > What I need to do is highlight the hits (terms) in the original HTML > > file (or get the positions of the terms/tokens in the original file). > > This problem has already been described by Fred Toth in this thread in > > 2005 (Preserving original HTML file offsets for highlighting, need > > HTMLTokenizer?): > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java- > user/200505.mbox/ > > %3c6.2.1.2.2.20050530134630.063ae...@fast.synernet.com%3E > > > > I've searched the mailing list archives hoping for an answer, but I > > had no luck. > > > > Does anyone have an idea, if there is a solution for this problem? > > Also if you know, that it's not possible with Lucene to highlight the > > hits in the original html-file, it would be helpful to know (I could > > stop looking for it...). > > > > Many thanks in advance! > > Karo > > > > P.S. Actually I wanted to answer the original thred/question from 2005 > > - is there a way to do this? How can I post an answer to an old > > thread/mail from the mailing list? > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org