Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Paul Taylor Thu, 20 Oct 2011 01:00:03 -0700

On 19/10/2011 15:17, Steven A Rowe wrote:

Hi Paul,


What version of Lucene are you using?  The JFlex spec you quote below looks 
pre-v3.1?

Yes, we copied a version of StandardTokenizer from 2.4 to make somechanges, we are actually on 3.1 now but haven't spent any time lookingat the new tokenizer flex code which appears better.


Anyway I finally have a proof of concept that I think will work this time

I realised that if someone enters 'fred!!!' I dont want to just matchmatch to 'fred', because then another token will be created for '!!!' soIve created separate rules for matching


fred     (ALPHANUM)
fred!!!  (EMAIL)
!!!        (COMPANY)

Modified jflex to catch

// basic word: a sequence of digits & letters (includes Thai to enableThaiAnalyzer to function)

ALPHANUM   = ({LETTER}|{THAI}|[:digit:])+

// 'PUNCTUATIONCONTROL' control/punctuation chars
CONTROLANDPUNC     =  ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")

COMPANY    =  ({CONTROLANDPUNC})+


//MUST CONTAIN Alphanumeric and Punctuation Characters

EMAIL =({ALPHANUM}|{CONTROLANDPUNC})*{CONTROLANDPUNC}{ALPHANUM}({ALPHANUM}|{CONTROLANDPUNC})*|({ALPHANUM}|{CONTROLANDPUNC})*{ALPHANUM}{CONTROLANDPUNC}({ALPHANUM}|{CONTROLANDPUNC})*

%%

{EMAIL} { returnEMAIL; }{ALPHANUM} { returnALPHANUM; }{COMPANY} { returnCOMPANY; }

Then I have a filter that looks for type=EMAIL and removes thosepunctuation chars


public final boolean incrementToken() throws java.io.IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char[] buffer = termAtt.buffer();
        final int bufferLength = termAtt.length();
        final String type = typeAtt.type();

if (type == EMAIL) { // remove control chars when theymake up only part of the token

            int upto = 0;
            for (int i = 0; i < bufferLength; i++) {
                char c = buffer[i];
                if (
                        (c == '!')
                   ) {
                    //Do Nothing, (drop the character)
                }
                else {
                    buffer[upto++] = c;
                }
            }
            termAtt.setLength(upto);
        }
        return true;
    }

I just need to improve the code to use suitable list of control charsrather than hardcoding individual chars.


This solution seems the closest fit to lucene.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Reply via email to