On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:

I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.

that almost worked. I can't use a per Field analyzer because I have to process the content fields of all documents. I built a custom analyzer which extended the Standard Analyzer and replaced the tokenStream method with a new one which used WhitespaceTokenizer instead of StandardTokenizer. This meant that my document IDs were not split, but I lost the conversion of acronyms such as w.o. to wo and the like

So what I need to do is to make a new Tokenizer based on the StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj should be

| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >

so that a serial number need not have a digit in every other segment and a series of letters and digits without special characters such as a dash will be treated as a single word.

Questions:

1) If I change the .jj file in this way, how to I run javaCC to make a new tokenizer? The JavaCC documentation says that JavaCC generates a number of output files; I think that I only need the tokenizer code.

2) I suppose i have to tell the query parser to parse queries in the same way, is that right?

The reason I think so is that Luke says I have words such as w.o. in the index which the query parser can't find. I suspect I have to use the same Analyzer on both, right?

On 8/29/06, Bill Taylor <[EMAIL PROTECTED]> wrote:

I am indexing documents which are filled with government jargon.  As
one would expect, the standard tokenizer has problems with
governmenteese.

In particular, the documents use words such as 310N-P-Q as references
to other documents.  The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to
install it and get my indexing system to use it.  I don't want to
modify the standard .jar file.  What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed.   My file analyzer isolates the
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to