Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Paul Taylor Wed, 19 Oct 2011 02:27:08 -0700

On 18/10/2011 15:25, Steven A Rowe wrote:

Hi Paul,


On 10/18/2011 at 4:57 AM, Paul Taylor wrote:

On 18/10/2011 06:19, Steven A Rowe wrote:

Another option is to create a char filter that substitutes
PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
etc.,

Yes that is how I first did it

No, I don't think you did.  When I say "char filter" I'm referring to 
CharFilter<http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html>
  - this is a different kind of thing from the token filter approach you described taking 
previously.

If you look at the code you can see I do use a CharFilter:

NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
    specialcharConvertMap.add("!", "Exclamation");
    specialcharConvertMap.add("?","QuestionMark");
    ...............

    public  TokenStream tokenStream(String fieldName, Reader reader) {

CharFilter specialCharFilter = newMappingCharFilter(specialcharConvertMap,reader);

StandardTokenizer tokenStream = newStandardTokenizer(LuceneVersion.LUCENE_VERSION);

        try
        {
            if(tokenStream.incrementToken()==false)
            {

tokenStream = newStandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);

            }
            else
            {

//TODO **************** set tokenstream back as it wasbefore increment token

            }
        }
        catch(IOException ioe)
        {

        }
        TokenStream result = new LowercaseFilter(result);
        return result;
    }


If you go with a CharFilter, you can give it access to the entire input at 
once, and use a regular expression (or something like it) to assess the input 
and then behave accordingly.

Steve

Well this is the problem, you cant use a regular expression or even ifyou did would that really slow things down wouldn't it, seeing as 99%dont need the transformation.


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Reply via email to