dangers of limiting tokenizers/disabling assertions in MockTokenizer?

Allison, Timothy B. Fri, 01 Nov 2013 06:31:30 -0700

All,
  I realize that we should be consuming all tokens from a stream.  I'd like to 
wrap a client's Analyzer with LimitTokenCountAnalyzer with consume=false. For 
the analyzers that I've used, this has caused no problems.  When I use 
MockTokenizer, I run into this assertion error: "end() called before 
incrementToken()".  The comment in MockTokenizer reads:


    // some tokenizers, such as limiting tokenizers, call end() before 
incrementToken() returns false.
    // these tests should disable this check (in general you should consume the 
entire stream)

 Disabling assertions gives me pause as does disobeying the workflow 
(http://lucene.apache.org/core/4_5_1/core/index.html).  I assume from the 
warnings that there are Analyzers and use cases that will fail unless the 
stream is entirely consumed.

  Is there a safe way to wrap a client Analyzer and only read x number of 
tokens?  Should I allow the client to decide whether or not to consume?

  Thank you!

             Best,

                  Tim

dangers of limiting tokenizers/disabling assertions in MockTokenizer?

Reply via email to