Re: Installing a custom tokenizer

Bill Taylor Tue, 29 Aug 2006 15:48:58 -0700

I have copied Lucene's StandardTokenizer.jj into my directory, renamedit, and did a global change of the names to my class name,LogTokenizer.

The issue is that the generated LogTokenizer.java does not compile for2 reasons:

1) in the constructor, this(new FastCharStream(reader)); fails becausethere is no such constructor in the parent class. I commented it out.

2) I get an error on the next() method which throws ParseException andIO Exception. The message is Exception ParseException is notcompatible with throws clause in TokenStream.next(). As far as I cansee, the exceptions are OK.

Since all of this is generated code, my feelings are a bit hurt. DidLucene use an older version of JavaCC? I am using javacc-4.0


On Aug 29, 2006, at 4:57 PM, Erick Erickson wrote:

Tucked away in the contrib section of Lucene (I'm using 2.0) thereis....


org.apache.lucene.index.memory.PatternAnalyzer

which takes a regular expression as and tokenizes with it. Would thathelp?Word of warning... the regex determines what is NOT a token, not whatIS a

token (as I remember), which threw me for a bit.

Don't know if this is really useful, but it might work for you withoutas

much work...

Best
[EMAIL PROTECTED]'mNowBeyondMyCompetence.WhyDoTheyStillEmployMeHere?

On 8/29/06, Bill Taylor <[EMAIL PROTECTED]> wrote:

On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:

>
> : Have a look at PerFieldAnalyzerWrapper:
>
> :
> http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
> PerFieldAnalyzerWrapper.html
>
> ...which can be specified in the constructors for IndexWriter and
> QueryParser.

As I understand it, this allows me to specify a different analyzer for
each field name.  My problem is that the standard analyzer will not
work for my content field and I need to define a new one.  I need to
make a modification to the StandardTokenizer so that a number does not
need to have a digit in every other segment of a part number.

For example, the StandardTokenizer breaks aa-bb-2 on the - between aa
and bb because it demands that every other string between a - have a
digit.

I need to modify the .jj file for the Standard Tokenizer and get a new
one, but I am confused by the javaCC documentation and do not know how
to run it to get what I need.

Thanks for the help.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Installing a custom tokenizer

Reply via email to