RE: How do you see if a tokenstream has tokens without consuming the tokens ?

Steven A Rowe Mon, 17 Oct 2011 21:20:32 -0700

Hi Paul,

You could add a rule to the StandardTokenizer JFlex grammar to handle this 
case, bypassing its other rules.


Another option is to create a char filter that substitutes PUNCT-EXCLAMATION 
for exclamation points, PUNCT-PERIOD for periods, etc., but only when the 
entire input consists exclusively of whitespace and punctuation.  These symbols 
would then be left intact by StandardTokenizer.

Steve

> -----Original Message-----
> From: Paul Taylor [mailto:paul_t...@fastmail.fm]
> Sent: Monday, October 17, 2011 8:13 AM
> To: 'java-user@lucene.apache.org'
> Subject: How do you see if a tokenstream has tokens without consuming the
> tokens ?
> 
> 
> We have a modified version of a Lucene StandardAnalyzer , we use it for
> tokenizing music metadata such as as artist names & song titles, so
> typically only a few words. On tokenizing it usually it strips out
> punctuations which is correct, however if the input text consists of
> only punctuation characters then we end up with nothing, for these
> particular RARE cases I want to use a mapping filter.
> 
> So what I try to do is have my analyzer tokenize as normal, then if the
> results is no tokens retokenize with the mapping filter , I check it has
> no token using incrementToken() but then cant see how I
> decrementToken(). How can I do this, or is there a more efficient way of
> doing this. Note of maybe 10,000,000 records only a few 100 records will
> have this problem so I need a solution which doesn't impact performance
> unreasonably.
> 
>      NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
>      specialcharConvertMap.add("!", "Exclamation");
>      specialcharConvertMap.add("?","QuestionMark");
>      ...............
> 
>      public  TokenStream tokenStream(String fieldName, Reader reader) {
>          CharFilter specialCharFilter = new
> MappingCharFilter(specialcharConvertMap,reader);
> 
>          StandardTokenizer tokenStream = new
> StandardTokenizer(LuceneVersion.LUCENE_VERSION);
>          try
>          {
>              if(tokenStream.incrementToken()==false)
>              {
>                  tokenStream = new
> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
>              }
>              else
>              {
>                  //TODO **************** set tokenstream back as it was
> before increment token
>              }
>          }
>          catch(IOException ioe)
>          {
> 
>          }
>          TokenStream result = new LowercaseFilter(result);
>          return result;
>      }
> 
> thanks for any help
> 
> 
> Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

Reply via email to