Re: Preserving original HTML file offsets for highlighting

Karolina Bernat Wed, 26 Jan 2011 01:53:29 -0800

Hi Uwe,

thank you so much for your help, it worked like a dream!:-)


I made a custom analyzer classand extended it from the StandardAnalyzer.
Then I needed to override the tokenStream method like that:

public TokenStream tokenStream(String fieldName, Reader reader) {
        CharStream chStream = CharReader.get(reader);
        HTMLStripCharFilter filter = new HTMLStripCharFilter(chStream);
        return super.tokenStream(fieldName, filter);
    }

and in the constructor I called the super constructor.
That worked really good and it was the ony place I needed to make some
changes.

Thanks once again!

Viele Grüße aus Hamburg,
Karo


On Tue, Jan 25, 2011 at 2:15 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi Karolina,
>
> for this no Solr is needed at all. The CharFilter is simply placed outside
> Lucene, but you can use without anything else from Solr. You can copy the
> java file from Solr's source, choose another package name and you are
> finished.
>
> About Tokenizer and Analyzer: StandardAnalyzer does the combination of
> Tokenizers and TokenFilters (and possibly CharFilters). It is just an
> easy-to-use class that serves as a factory for TokenStreams (which is the
> superclass of Tokenizers). If you want your own analysis, you have to
> implement an Analyzer class (possibly use StandardAnalyzer source code as
> basis) and add the needed Filters (this HTMLStripCharFilter) to the factory
> method.
>
> You may read the analysis' package javadocs to get information how to do
> this. Note: This HTMLStripCharFilter does not need TIKA at all).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -----Original Message-----
> > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com]
> > Sent: Tuesday, January 25, 2011 1:45 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Preserving original HTML file offsets for highlighting
> >
> > Hi Uwe,
> >
> > thanks for this hint. I'm not sure, how much of the Solr functionality do
> I
> > need to implement for using the HTTPStripCharFilter. I'm using Apache
> Tika
> > for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my
> > IndexWriter. I don't use a Tokenizer - this would be the Solr approach?
> >
> > At this point, I'm not sure, how to use the Solr within my application,
> where I
> > already use Lucene. Can I use i.e. just this one or few classes from the
> Solr
> > Core while indexing with Lucene IndexWriter? Or do I need to switch my
> > indexing and searching to the Solr way, just to get what I need
> (highlighting
> > of the hits within HTML files).
> >
> > Thank you so much for your help:-)
> > Karo
> >
> >
> >
> > On Mon, Jan 24, 2011 at 2:03 PM, Karolina Bernat <
> > karolina.ber...@googlemail.com> wrote:
> >
> > > Hi all,
> > >
> > > I'm new to Lucene and have a question about indexing/highlighting of
> > > HTML files with Lucene.
> > >
> > > What I need to do is highlight the hits (terms) in the original HTML
> > > file (or get the positions of the terms/tokens in the original file).
> > > This problem has already been described by Fred Toth in this thread in
> > > 2005 (Preserving original HTML file offsets for highlighting, need
> > > HTMLTokenizer?):
> > >
> > >
> > > http://mail-archives.apache.org/mod_mbox/lucene-java-
> > user/200505.mbox/
> > > %3c6.2.1.2.2.20050530134630.063ae...@fast.synernet.com%3E
> > >
> > > I've searched the mailing list archives hoping for an answer, but I
> > > had no luck.
> > >
> > > Does anyone have an idea, if there is a solution for this problem?
> > > Also if you know, that it's not possible with Lucene to highlight the
> > > hits in the original html-file, it would be helpful to know (I could
> > > stop looking for it...).
> > >
> > > Many thanks in advance!
> > > Karo
> > >
> > > P.S. Actually I wanted to answer the original thred/question from 2005
> > > - is there a way to do this? How can I post an answer to an old
> > > thread/mail from the mailing list?
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Preserving original HTML file offsets for highlighting

Reply via email to