On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
right up.
that almost worked. I can't use a per Field analyzer because I have to
process the content fields of all documents. I built a custom analyzer
which extended the Standard Analyzer and replaced the tokenStream
method with a new one which used WhitespaceTokenizer instead of
StandardTokenizer. This meant that my document IDs were not split, but
I lost the conversion of acronyms such as w.o. to wo and the like
So what I need to do is to make a new Tokenizer based on the
StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj
should be
| NUM: (<ALPHANUM> (<P> <ALPHANUM>) + | <ALPHANUM>) >
so that a serial number need not have a digit in every other segment
and a series of letters and digits without special characters such as a
dash will be treated as a single word.
Questions:
1) If I change the .jj file in this way, how to I run javaCC to make a
new tokenizer? The JavaCC documentation says that JavaCC generates a
number of output files; I think that I only need the tokenizer code.
2) I suppose i have to tell the query parser to parse queries in the
same way, is that right?
The reason I think so is that Luke says I have words such as w.o. in
the index which the query parser can't find. I suspect I have to use
the same Analyzer on both, right?
On 8/29/06, Bill Taylor <[EMAIL PROTECTED]> wrote:
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.
I know how to write a new tokenizer. I would like hints on how to
install it and get my indexing system to use it. I don't want to
modify the standard .jar file. What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the
normal
one, but I am unsure how to do this.
I know that the IndexTask has a setAnalyzer method. The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed. My file analyzer isolates the
string I want to index, then does
doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));
I suspect that my issue is getting the Field constructor to use a
different tokenizer. Can anyone help?
Thanks.
Bill Taylor
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]