Re: Installing a custom tokenizer

Ronnie Kolehmainen Tue, 29 Aug 2006 09:13:39 -0700

Have a look at PerFieldAnalyzerWrapper:


http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html


Citerar Bill Taylor <[EMAIL PROTECTED]>:

> is there some way to get the standard Field constructor to use, say, 
> the Whitespace Tokenizer as opposed to the standard tokenizer?
> 
> On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:
> 
> >> I suspect that my issue is getting the Field constructor to use a
> >> different tokenizer.  Can anyone help?
> >
> > You need to basically come up with your own Tokenizer (You can always
> > write a corresponding JavaCC grammar and compiling it would give the
> > Tokenizer)
> > Then you need to extend org.apache.lucene.analysis.Analyzer class and
> > override the tokenStream() method. Now, wherever you are
> > indexing/searching, use the object of this CustomAnalyzer.
> > Public class MyAnalyzer extended Analyzer
> > {
> >     public TokenStream tokenStream(....)
> >     {
> >             TokenStream ts = null;
> >             ts = new MyTokenizer(reader);
> >             /* Pass this tokenstream through other filters you are
> > interested in */
> >     }
> > }
> >
> > Krovi.
> >
> > -----Original Message-----
> > From: Bill Taylor [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 29, 2006 8:10 PM
> > To: java-user@lucene.apache.org
> > Subject: Installing a custom tokenizer
> >
> > I am indexing documents which are filled with government jargon.  As
> > one would expect, the standard tokenizer has problems with
> > governmenteese.
> >
> > In particular, the documents use words such as 310N-P-Q as references
> > to other documents.  The standard tokenizer breaks this "word" at the
> > dashes so that I can find P or Q but not the entire token.
> >
> > I know how to write a new tokenizer.  I would like hints on how to
> > install it and get my indexing system to use it.  I don't want to
> > modify the standard .jar file.  What I think I want to do is set up my
> > indexing operation to use the WhitespaceTokenizer instead of the normal
> > one, but I am unsure how to do this.
> >
> > I know that the IndexTask has a setAnalyzer method.  The document
> > formats are rather complicated and I need special code to isolate the
> > text strings which should be indexed.   My file analyzer isolates the
> > string I want to index, then does
> >
> > doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> > Field.Store.YES, Field.index.TOKENIZED));
> >
> > I suspect that my issue is getting the Field constructor to use a
> > different tokenizer.  Can anyone help?
> >
> > Thanks.
> >
> > Bill Taylor
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Installing a custom tokenizer

Reply via email to