Hi Michael,

Michael Böckling wrote:
> Hi folks!
> 
> The topic says it all: I want to modify the StandardAnalyzer so that it also
> splits words after punctuation characters (.,: etc.) that are NOT followed
> by a whitespace character, in addition to punctuation characters that ARE
> followed by whitespace.
> 
> Of course i've looked at StandardTokenizer.jj, but I don't quite get it. The
> recursive nature of the grammar bends my mind.
> 
> Can someone smarter than me help here?

Um, that probably disqualifies me, but anyway...

There are several regexes in StandardTokenizer.jj that generate tokens
containing punctuation.  You should be able to selectively comment them
out to achieve what you want:

1. Acronyms:

  | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

2. Company names:

  | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

3. Email addresses:

  | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
    (("."|"-") <ALPHANUM>)+ >

4. Hostnames:

  | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:

  | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
         | <HAS_DIGIT> <P> <ALPHANUM>
         | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
         | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
         | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
         | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
          )
    >
  | <#P: ("_"|"-"|"/"|"."|",") >
  | <#HAS_DIGIT:                  // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
    >


Steve

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to