Re: Changing the Punctuation definition for StandardAnalyzer

tareque Thu, 20 Dec 2007 13:32:50 -0800

Karl,

I should have mentioned before, I have Lucene 1.9.1.


In fact I had previously located the grammar in StandardTokenizer.jj (just
wasn't sure if that was the one u were talking about) and had commented
out EMAIL entries from all the following files:

StandardTokenizer.java
StandardTokenizer.jj
StandardTokenizerConstants.java

But evidently the tokenizer was expecting the email addresses to be one of
the other TOKEN types. But since they were matching with none of them it
was throwing a ParseException.

Now what is puzzling to me is that though I don't see the '@' (unicode
value 0040) sign to be included in "LETTER" or any other definition, why
is it not  splitting the words? It certainly isn't, which is why Tokenizer
is expecting the email address to be defined as a TYPE. My understanding,
looking at the code, is that whichever characters were not defined in the
grammar, would be acting as splitter, since they are not contributing to
any TOKEN definition.

Please let me know what I am missing.

Thanks
Tareque

>
> 20 dec 2007 kl. 20.21 skrev [EMAIL PROTECTED]:
>
>> I would rather like to modify the lexer grammar. But exactly where
>> it is
>> defined. After having a quick look, seems like
>> StandardTokenizerTokenManager.java may be where it is being done.
>
> http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> It can be generated with the Ant build.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Changing the Punctuation definition for StandardAnalyzer

Reply via email to