Ok, I've followed your advice and commented out some Lines in the NUM section. It now works as espected, thanks a lot, I just tried and it does what I wanted it to do now. It looks scary, but isn't that bad.
Thanks! Regards, Michael > -----Ursprüngliche Nachricht----- > Von: Steven Rowe [mailto:[EMAIL PROTECTED] > Gesendet: Dienstag, 29. Mai 2007 19:54 > An: java-user@lucene.apache.org > Betreff: Re: Modifying StandardAnalyzer so that it also splits words > after pun ctuation characters that are not followed by whitespace > > > Hi Michael, > > Michael Böckling wrote: > > Hi folks! > > > > The topic says it all: I want to modify the > StandardAnalyzer so that it also > > splits words after punctuation characters (.,: etc.) that > are NOT followed > > by a whitespace character, in addition to punctuation > characters that ARE > > followed by whitespace. > > > > Of course i've looked at StandardTokenizer.jj, but I don't > quite get it. The > > recursive nature of the grammar bends my mind. > > > > Can someone smarter than me help here? > > Um, that probably disqualifies me, but anyway... > > There are several regexes in StandardTokenizer.jj that generate tokens > containing punctuation. You should be able to selectively > comment them > out to achieve what you want: > > 1. Acronyms: > > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ > > > 2. Company names: > > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> > > > 3. Email addresses: > > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> > (("."|"-") <ALPHANUM>)+ > > > 4. Hostnames: > > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ > > > 5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.: > > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT> > | <HAS_DIGIT> <P> <ALPHANUM> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> > <HAS_DIGIT>)+ > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> > <ALPHANUM>)+ > ) > > > | <#P: ("_"|"-"|"/"|"."|",") > > | <#HAS_DIGIT: // at least one digit > (<LETTER>|<DIGIT>)* > <DIGIT> > (<LETTER>|<DIGIT>)* > > > > > Steve > > -- > Steve Rowe > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]