How do you see if a tokenstream has tokens without consuming the tokens ?

Paul Taylor Mon, 17 Oct 2011 05:13:32 -0700

We have a modified version of a Lucene StandardAnalyzer , we use it fortokenizing music metadata such as as artist names & song titles, sotypically only a few words. On tokenizing it usually it strips outpunctuations which is correct, however if the input text consists ofonly punctuation characters then we end up with nothing, for theseparticular RARE cases I want to use a mapping filter.

So what I try to do is have my analyzer tokenize as normal, then if theresults is no tokens retokenize with the mapping filter , I check it hasno token using incrementToken() but then cant see how IdecrementToken(). How can I do this, or is there a more efficient way ofdoing this. Note of maybe 10,000,000 records only a few 100 records willhave this problem so I need a solution which doesn't impact performanceunreasonably.


    NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
    specialcharConvertMap.add("!", "Exclamation");
    specialcharConvertMap.add("?","QuestionMark");
    ...............

    public  TokenStream tokenStream(String fieldName, Reader reader) {

CharFilter specialCharFilter = newMappingCharFilter(specialcharConvertMap,reader);

StandardTokenizer tokenStream = newStandardTokenizer(LuceneVersion.LUCENE_VERSION);

        try
        {
            if(tokenStream.incrementToken()==false)
            {

tokenStream = newStandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);

            }
            else
            {

//TODO **************** set tokenstream back as it wasbefore increment token

            }
        }
        catch(IOException ioe)
        {

        }
        TokenStream result = new LowercaseFilter(result);
        return result;
    }

thanks for any help


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

How do you see if a tokenstream has tokens without consuming the tokens ?

Reply via email to