Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Paul Taylor Wed, 19 Oct 2011 03:50:21 -0700

On 18/10/2011 05:19, Steven A Rowe wrote:

Hi Paul,


You could add a rule to the StandardTokenizer JFlex grammar to handle this 
case, bypassing its other rules.

THis seemed to be working, just to test it out I changed the EMAIL oneto this


EMAIL     =  ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+

And changed the order the tokens were checked

%%

{ALPHANUM} { returnALPHANUM; }{APOSTROPHE} { returnAPOSTROPHE; }{ACRONYM} { returnACRONYM; }{COMPANY} { returnCOMPANY; }{HOST} { returnHOST; }{NUM} { returnNUM; }{CJ} { returnCJ; }{ACRONYM_DEP} { returnACRONYM_DEP; }{EMAIL} { returnEMAIL; }


/** Ignore the rest */

. | {WHITESPACE} { /*ignore */ }

So then if I passed "!!!' to the tokenizer, it kept it which was exactlywhat I wanted


However if I passed it 'fred!!!' it  split it into two tokens

'fred' and '!!!'

which is not what I wanted, I just wanted to get back

fred


I tried chnaging EMAIL to

EMAIL     =  ^("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+

but use of ^ and $ seem to be disallowed, so I cant see if there isanyway to do what I want in the jflex, if thats the case can I drop the2nd filter somehow in a subsequent filter ?



Paul


Another option is to create a char filter that substitutes PUNCT-EXCLAMATION 
for exclamation points, PUNCT-PERIOD for periods, etc., but only when the 
entire input consists exclusively of whitespace and punctuation.  These symbols 
would then be left intact by StandardTokenizer.

Steve

-----Original Message-----
From: Paul Taylor [mailto:paul_t...@fastmail.fm]
Sent: Monday, October 17, 2011 8:13 AM
To: 'java-user@lucene.apache.org'
Subject: How do you see if a tokenstream has tokens without consuming the
tokens ?

We have a modified version of a Lucene StandardAnalyzer , we use it for
tokenizing music metadata such as as artist names&  song titles, so
typically only a few words. On tokenizing it usually it strips out
punctuations which is correct, however if the input text consists of
only punctuation characters then we end up with nothing, for these
particular RARE cases I want to use a mapping filter.

So what I try to do is have my analyzer tokenize as normal, then if the
results is no tokens retokenize with the mapping filter , I check it has
no token using incrementToken() but then cant see how I
decrementToken(). How can I do this, or is there a more efficient way of
doing this. Note of maybe 10,000,000 records only a few 100 records will
have this problem so I need a solution which doesn't impact performance
unreasonably.

      NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
      specialcharConvertMap.add("!", "Exclamation");
      specialcharConvertMap.add("?","QuestionMark");
      ...............

      public  TokenStream tokenStream(String fieldName, Reader reader) {
          CharFilter specialCharFilter = new
MappingCharFilter(specialcharConvertMap,reader);

          StandardTokenizer tokenStream = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION);
          try
          {
              if(tokenStream.incrementToken()==false)
              {
                  tokenStream = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
              }
              else
              {
                  //TODO **************** set tokenstream back as it was
before increment token
              }
          }
          catch(IOException ioe)
          {

          }
          TokenStream result = new LowercaseFilter(result);
          return result;
      }

thanks for any help

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Reply via email to