Hi Paul, You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing its other rules.
Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but only when the entire input consists exclusively of whitespace and punctuation. These symbols would then be left intact by StandardTokenizer. Steve > -----Original Message----- > From: Paul Taylor [mailto:paul_t...@fastmail.fm] > Sent: Monday, October 17, 2011 8:13 AM > To: 'java-user@lucene.apache.org' > Subject: How do you see if a tokenstream has tokens without consuming the > tokens ? > > > We have a modified version of a Lucene StandardAnalyzer , we use it for > tokenizing music metadata such as as artist names & song titles, so > typically only a few words. On tokenizing it usually it strips out > punctuations which is correct, however if the input text consists of > only punctuation characters then we end up with nothing, for these > particular RARE cases I want to use a mapping filter. > > So what I try to do is have my analyzer tokenize as normal, then if the > results is no tokens retokenize with the mapping filter , I check it has > no token using incrementToken() but then cant see how I > decrementToken(). How can I do this, or is there a more efficient way of > doing this. Note of maybe 10,000,000 records only a few 100 records will > have this problem so I need a solution which doesn't impact performance > unreasonably. > > NormalizeCharMap specialcharConvertMap = new NormalizeCharMap(); > specialcharConvertMap.add("!", "Exclamation"); > specialcharConvertMap.add("?","QuestionMark"); > ............... > > public TokenStream tokenStream(String fieldName, Reader reader) { > CharFilter specialCharFilter = new > MappingCharFilter(specialcharConvertMap,reader); > > StandardTokenizer tokenStream = new > StandardTokenizer(LuceneVersion.LUCENE_VERSION); > try > { > if(tokenStream.incrementToken()==false) > { > tokenStream = new > StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter); > } > else > { > //TODO **************** set tokenstream back as it was > before increment token > } > } > catch(IOException ioe) > { > > } > TokenStream result = new LowercaseFilter(result); > return result; > } > > thanks for any help > > > Paul > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org