We have a modified version of a Lucene StandardAnalyzer , we use it for tokenizing music metadata such as as artist names & song titles, so typically only a few words. On tokenizing it usually it strips out punctuations which is correct, however if the input text consists of only punctuation characters then we end up with nothing, for these particular RARE cases I want to use a mapping filter.

So what I try to do is have my analyzer tokenize as normal, then if the results is no tokens retokenize with the mapping filter , I check it has no token using incrementToken() but then cant see how I decrementToken(). How can I do this, or is there a more efficient way of doing this. Note of maybe 10,000,000 records only a few 100 records will have this problem so I need a solution which doesn't impact performance unreasonably.

    NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
    specialcharConvertMap.add("!", "Exclamation");
    specialcharConvertMap.add("?","QuestionMark");
    ...............

    public  TokenStream tokenStream(String fieldName, Reader reader) {
CharFilter specialCharFilter = new MappingCharFilter(specialcharConvertMap,reader);

StandardTokenizer tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION);
        try
        {
            if(tokenStream.incrementToken()==false)
            {
tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
            }
            else
            {
//TODO **************** set tokenstream back as it was before increment token
            }
        }
        catch(IOException ioe)
        {

        }
        TokenStream result = new LowercaseFilter(result);
        return result;
    }

thanks for any help


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to