On 19/10/2011 15:17, Steven A Rowe wrote:
Hi Paul,

What version of Lucene are you using?  The JFlex spec you quote below looks 
pre-v3.1?

Yes, we copied a version of StandardTokenizer from 2.4 to make some changes, we are actually on 3.1 now but haven't spent any time looking at the new tokenizer flex code which appears better.

Anyway I finally have a proof of concept that I think will work this time

I realised that if someone enters 'fred!!!' I dont want to just match match to 'fred', because then another token will be created for '!!!' so Ive created separate rules for matching

fred     (ALPHANUM)
fred!!!  (EMAIL)
!!!        (COMPANY)

Modified jflex to catch

// basic word: a sequence of digits & letters (includes Thai to enable ThaiAnalyzer to function)
ALPHANUM   = ({LETTER}|{THAI}|[:digit:])+

// 'PUNCTUATIONCONTROL' control/punctuation chars
CONTROLANDPUNC     =  ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")

COMPANY    =  ({CONTROLANDPUNC})+


//MUST CONTAIN Alphanumeric and Punctuation Characters
EMAIL = ({ALPHANUM}|{CONTROLANDPUNC})*{CONTROLANDPUNC}{ALPHANUM}({ALPHANUM}|{CONTROLANDPUNC})* | ({ALPHANUM}|{CONTROLANDPUNC})*{ALPHANUM}{CONTROLANDPUNC}({ALPHANUM}|{CONTROLANDPUNC})*

%%

{EMAIL} { return EMAIL; } {ALPHANUM} { return ALPHANUM; } {COMPANY} { return COMPANY; }

Then I have a filter that looks for type=EMAIL and removes those punctuation chars

public final boolean incrementToken() throws java.io.IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char[] buffer = termAtt.buffer();
        final int bufferLength = termAtt.length();
        final String type = typeAtt.type();

if (type == EMAIL) { // remove control chars when they make up only part of the token
            int upto = 0;
            for (int i = 0; i < bufferLength; i++) {
                char c = buffer[i];
                if (
                        (c == '!')
                   ) {
                    //Do Nothing, (drop the character)
                }
                else {
                    buffer[upto++] = c;
                }
            }
            termAtt.setLength(upto);
        }
        return true;
    }

I just need to improve the code to use suitable list of control chars rather than hardcoding individual chars.

This solution seems the closest fit to lucene.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to