Re: Installing a custom tokenizer

Bill Taylor Tue, 29 Aug 2006 08:34:33 -0700

is there some way to get the standard Field constructor to use, say,the Whitespace Tokenizer as opposed to the standard tokenizer?


On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?


You need to basically come up with your own Tokenizer (You can always
write a corresponding JavaCC grammar and compiling it would give the
Tokenizer)
Then you need to extend org.apache.lucene.analysis.Analyzer class and
override the tokenStream() method. Now, wherever you are
indexing/searching, use the object of this CustomAnalyzer.
Public class MyAnalyzer extended Analyzer
{
        public TokenStream tokenStream(....)
        {
                TokenStream ts = null;
                ts = new MyTokenizer(reader);
                /* Pass this tokenstream through other filters you are
interested in */
        }
}

Krovi.

-----Original Message-----
From: Bill Taylor [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 29, 2006 8:10 PM
To: java-user@lucene.apache.org
Subject: Installing a custom tokenizer

I am indexing documents which are filled with government jargon.  As
one would expect, the standard tokenizer has problems with
governmenteese.

In particular, the documents use words such as 310N-P-Q as references
to other documents.  The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to
install it and get my indexing system to use it.  I don't want to
modify the standard .jar file.  What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed.   My file analyzer isolates the
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Installing a custom tokenizer

Reply via email to