Re: Installing a custom tokenizer

Bill Taylor Tue, 29 Aug 2006 11:39:07 -0700


On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:

I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, whichcan bea PerFieldAnalyzerWrapper. That, if I understand your issue, shouldfix you
right up.

that almost worked. I can't use a per Field analyzer because I have toprocess the content fields of all documents. I built a custom analyzerwhich extended the Standard Analyzer and replaced the tokenStreammethod with a new one which used WhitespaceTokenizer instead ofStandardTokenizer. This meant that my document IDs were not split, butI lost the conversion of acronyms such as w.o. to wo and the like

So what I need to do is to make a new Tokenizer based on theStandardTokenizer except that a NUM on line 83 of StandardTokenizer.jjshould be


| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >

so that a serial number need not have a digit in every other segmentand a series of letters and digits without special characters such as adash will be treated as a single word.


Questions:

1) If I change the .jj file in this way, how to I run javaCC to make anew tokenizer? The JavaCC documentation says that JavaCC generates anumber of output files; I think that I only need the tokenizer code.

2) I suppose i have to tell the query parser to parse queries in thesame way, is that right?

The reason I think so is that Luke says I have words such as w.o. inthe index which the query parser can't find. I suspect I have to usethe same Analyzer on both, right?

On 8/29/06, Bill Taylor <[EMAIL PROTECTED]> wrote:


I am indexing documents which are filled with government jargon.  As
one would expect, the standard tokenizer has problems with
governmenteese.

In particular, the documents use words such as 310N-P-Q as references
to other documents.  The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to
install it and get my indexing system to use it.  I don't want to
modify the standard .jar file.  What I think I want to do is set up my

indexing operation to use the WhitespaceTokenizer instead of thenormal

one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed.   My file analyzer isolates the
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Installing a custom tokenizer

Reply via email to