Re: StandardTokenizer#setMaxTokenLength

Steve Rowe Mon, 20 Jul 2015 11:01:51 -0700

Hi Piotr,

The behavior you mention is an intentional change from the behavior in Lucene 
4.9.0 and earlier, when tokens longer than maxTokenLenth were silently ignored: 
see LUCENE-5897[1] and LUCENE-5400[2].


The new behavior is as follows: Token matching rules are no longer allowed to 
match against input char sequences longer than maxTokenLength.  If a rule that 
would match a sequence longer than maxTokenLength, but also matches at 
maxTokenLength chars or fewer, and has the highest priority among all other 
rules matching at this length, and no other rule matches more chars, then a 
token will be emitted for that rule at the matching length.  And then the 
rule-matching iteration simply continues from that point as normal.  If the 
same rule matches against the remainder of the sequence that the first rule 
would have matched if maxTokenLength were longer, then another token at the 
matched length will be emitted, and so on.  Note that this can result in 
effectively splitting the sequence at maxTokenLength intervals as you noted.

I doubt ClassicAnalyzer has the same issue, since it isn’t built with the 
scanner buffer limitation technique used when constructing StandardTokenizer 
and UAX29URLEmailTokenizer.

Steve

[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400

> On Jul 20, 2015, at 4:21 AM, Piotr Idzikowski <piotridzikow...@gmail.com> 
> wrote:
> 
> Hello.
> Btw, I think ClassicAnalyzer has the same problem
> 
> Regards
> 
> On Fri, Jul 17, 2015 at 4:40 PM, Steve Rowe <sar...@gmail.com> wrote:
> 
>> Hi Piotr,
>> 
>> Thanks for reporting!
>> 
>> See https://issues.apache.org/jira/browse/LUCENE-6682
>> 
>> Steve
>> www.lucidworks.com
>> 
>>> On Jul 16, 2015, at 4:47 AM, Piotr Idzikowski <piotridzikow...@gmail.com>
>> wrote:
>>> 
>>> Hello.
>>> I am developing own analyzer based on StandardAnalyzer.
>>> I realized that tokenizer.setMaxTokenLength is called many times.
>>> 
>>> *protected TokenStreamComponents createComponents(final String fieldName,
>>> final Reader reader) {*
>>> *    final StandardTokenizer src = new StandardTokenizer(getVersion(),
>>> reader);*
>>> *    src.setMaxTokenLength(maxTokenLength);*
>>> *    TokenStream tok = new StandardFilter(getVersion(), src);*
>>> *    tok = new LowerCaseFilter(getVersion(), tok);*
>>> *    tok = new StopFilter(getVersion(), tok, stopwords);*
>>> *    return new TokenStreamComponents(src, tok) {*
>>> *      @Override*
>>> *      protected void setReader(final Reader reader) throws IOException
>> {*
>>> *        src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);*
>>> *        super.setReader(reader);*
>>> *      }*
>>> *    };*
>>> *  }*
>>> 
>>> Does it make sense if length stays the same? I see it finally calls this
>>> one( in StandardTokenizerImpl ):
>>> *public final void setBufferSize(int numChars) {*
>>> *     ZZ_BUFFERSIZE = numChars;*
>>> *     char[] newZzBuffer = new char[ZZ_BUFFERSIZE];*
>>> *     System.arraycopy(zzBuffer, 0, newZzBuffer, 0,
>>> Math.min(zzBuffer.length, ZZ_BUFFERSIZE));*
>>> *     zzBuffer = newZzBuffer;*
>>> *   }*
>>> So it just copies old array content into the new one.
>>> 
>>> Regards
>>> Piotr Idzikowski
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: StandardTokenizer#setMaxTokenLength

Reply via email to