Have a look at PerFieldAnalyzerWrapper:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html Citerar Bill Taylor <[EMAIL PROTECTED]>: > is there some way to get the standard Field constructor to use, say, > the Whitespace Tokenizer as opposed to the standard tokenizer? > > On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote: > > >> I suspect that my issue is getting the Field constructor to use a > >> different tokenizer. Can anyone help? > > > > You need to basically come up with your own Tokenizer (You can always > > write a corresponding JavaCC grammar and compiling it would give the > > Tokenizer) > > Then you need to extend org.apache.lucene.analysis.Analyzer class and > > override the tokenStream() method. Now, wherever you are > > indexing/searching, use the object of this CustomAnalyzer. > > Public class MyAnalyzer extended Analyzer > > { > > public TokenStream tokenStream(....) > > { > > TokenStream ts = null; > > ts = new MyTokenizer(reader); > > /* Pass this tokenstream through other filters you are > > interested in */ > > } > > } > > > > Krovi. > > > > -----Original Message----- > > From: Bill Taylor [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 29, 2006 8:10 PM > > To: java-user@lucene.apache.org > > Subject: Installing a custom tokenizer > > > > I am indexing documents which are filled with government jargon. As > > one would expect, the standard tokenizer has problems with > > governmenteese. > > > > In particular, the documents use words such as 310N-P-Q as references > > to other documents. The standard tokenizer breaks this "word" at the > > dashes so that I can find P or Q but not the entire token. > > > > I know how to write a new tokenizer. I would like hints on how to > > install it and get my indexing system to use it. I don't want to > > modify the standard .jar file. What I think I want to do is set up my > > indexing operation to use the WhitespaceTokenizer instead of the normal > > one, but I am unsure how to do this. > > > > I know that the IndexTask has a setAnalyzer method. The document > > formats are rather complicated and I need special code to isolate the > > text strings which should be indexed. My file analyzer isolates the > > string I want to index, then does > > > > doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>, > > Field.Store.YES, Field.index.TOKENIZED)); > > > > I suspect that my issue is getting the Field constructor to use a > > different tokenizer. Can anyone help? > > > > Thanks. > > > > Bill Taylor > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]