Hi Paul, Since you have modified the StandardAnalyzer (I presume you mean StandardFilter), why not do a check on the term.text() and if its all punctuation, skip the analysis for that term? Something like this in your StandardFilter:
public final boolean incrementToken() throws IOException { CharTermAttribute ta = getAttribute(CharTermAttribute.class); if (isAllPunctuation(ta.buffer()) { return true; } else { ... normal processing here } } If the filters are made keyword attribute aware (I have a bug open on this, LUCENE-3236, although I only asked for Lowercase and Stop filters in here), then its even simpler, you can plug in your own filter that marks the term as a KeywordAttribute so downstream filters pass it through. -sujit On Mon, 2011-10-17 at 13:12 +0100, Paul Taylor wrote: > We have a modified version of a Lucene StandardAnalyzer , we use it for > tokenizing music metadata such as as artist names & song titles, so > typically only a few words. On tokenizing it usually it strips out > punctuations which is correct, however if the input text consists of > only punctuation characters then we end up with nothing, for these > particular RARE cases I want to use a mapping filter. > > So what I try to do is have my analyzer tokenize as normal, then if the > results is no tokens retokenize with the mapping filter , I check it has > no token using incrementToken() but then cant see how I > decrementToken(). How can I do this, or is there a more efficient way of > doing this. Note of maybe 10,000,000 records only a few 100 records will > have this problem so I need a solution which doesn't impact performance > unreasonably. > > NormalizeCharMap specialcharConvertMap = new NormalizeCharMap(); > specialcharConvertMap.add("!", "Exclamation"); > specialcharConvertMap.add("?","QuestionMark"); > ............... > > public TokenStream tokenStream(String fieldName, Reader reader) { > CharFilter specialCharFilter = new > MappingCharFilter(specialcharConvertMap,reader); > > StandardTokenizer tokenStream = new > StandardTokenizer(LuceneVersion.LUCENE_VERSION); > try > { > if(tokenStream.incrementToken()==false) > { > tokenStream = new > StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter); > } > else > { > //TODO **************** set tokenstream back as it was > before increment token > } > } > catch(IOException ioe) > { > > } > TokenStream result = new LowercaseFilter(result); > return result; > } > > thanks for any help > > > Paul > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org