Installing a custom tokenizer

Bill Taylor Tue, 29 Aug 2006 07:07:11 -0700

I am indexing documents which are filled with government jargon. Asone would expect, the standard tokenizer has problems withgovernmenteese.

In particular, the documents use words such as 310N-P-Q as referencesto other documents. The standard tokenizer breaks this "word" at thedashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer. I would like hints on how toinstall it and get my indexing system to use it. I don't want tomodify the standard .jar file. What I think I want to do is set up myindexing operation to use the WhitespaceTokenizer instead of the normalone, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method. The documentformats are rather complicated and I need special code to isolate thetext strings which should be indexed. My file analyzer isolates thestring I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use adifferent tokenizer. Can anyone help?


Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Installing a custom tokenizer

Reply via email to