Karl, I should have mentioned before, I have Lucene 1.9.1.
In fact I had previously located the grammar in StandardTokenizer.jj (just wasn't sure if that was the one u were talking about) and had commented out EMAIL entries from all the following files: StandardTokenizer.java StandardTokenizer.jj StandardTokenizerConstants.java But evidently the tokenizer was expecting the email addresses to be one of the other TOKEN types. But since they were matching with none of them it was throwing a ParseException. Now what is puzzling to me is that though I don't see the '@' (unicode value 0040) sign to be included in "LETTER" or any other definition, why is it not splitting the words? It certainly isn't, which is why Tokenizer is expecting the email address to be defined as a TYPE. My understanding, looking at the code, is that whichever characters were not defined in the grammar, would be acting as splitter, since they are not contributing to any TOKEN definition. Please let me know what I am missing. Thanks Tareque > > 20 dec 2007 kl. 20.21 skrev [EMAIL PROTECTED]: > >> I would rather like to modify the lexer grammar. But exactly where >> it is >> defined. After having a quick look, seems like >> StandardTokenizerTokenManager.java may be where it is being done. > > http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex > > It can be generated with the Ant build. > > -- > karl > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]