AW: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

Michael Böckling Wed, 30 May 2007 02:58:19 -0700

Ok, I've followed your advice and commented out some Lines in the NUM
section. It now works as espected, thanks a lot, I just tried and it does
what I wanted it to do now. It looks scary, but isn't that bad.


Thanks!

Regards,
Michael



> -----Ursprüngliche Nachricht-----
> Von: Steven Rowe [mailto:[EMAIL PROTECTED]
> Gesendet: Dienstag, 29. Mai 2007 19:54
> An: java-user@lucene.apache.org
> Betreff: Re: Modifying StandardAnalyzer so that it also splits words
> after pun ctuation characters that are not followed by whitespace
> 
> 
> Hi Michael,
> 
> Michael Böckling wrote:
> > Hi folks!
> > 
> > The topic says it all: I want to modify the 
> StandardAnalyzer so that it also
> > splits words after punctuation characters (.,: etc.) that 
> are NOT followed
> > by a whitespace character, in addition to punctuation 
> characters that ARE
> > followed by whitespace.
> > 
> > Of course i've looked at StandardTokenizer.jj, but I don't 
> quite get it. The
> > recursive nature of the grammar bends my mind.
> > 
> > Can someone smarter than me help here?
> 
> Um, that probably disqualifies me, but anyway...
> 
> There are several regexes in StandardTokenizer.jj that generate tokens
> containing punctuation.  You should be able to selectively 
> comment them
> out to achieve what you want:
> 
> 1. Acronyms:
> 
>   | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> 
> 2. Company names:
> 
>   | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> 
> 3. Email addresses:
> 
>   | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>     (("."|"-") <ALPHANUM>)+ >
> 
> 4. Hostnames:
> 
>   | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> 
> 5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
> 
>   | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>          | <HAS_DIGIT> <P> <ALPHANUM>
>          | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>          | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>          | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> 
> <HAS_DIGIT>)+
>          | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> 
> <ALPHANUM>)+
>           )
>     >
>   | <#P: ("_"|"-"|"/"|"."|",") >
>   | <#HAS_DIGIT:                // at least one digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>     >
> 
> 
> Steve
> 
> -- 
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

Reply via email to