I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.
I know how to write a new tokenizer. I would like hints on how to
install it and get my indexing system to use it. I don't want to
modify the standard .jar file. What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.
I know that the IndexTask has a setAnalyzer method. The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed. My file analyzer isolates the
string I want to index, then does
doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));
I suspect that my issue is getting the Field constructor to use a
different tokenizer. Can anyone help?
Thanks.
Bill Taylor
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]