RE: Lucene 4.0 tokenstream logic

Uwe Schindler Thu, 11 Jul 2013 00:58:25 -0700

Please post the code of your TokenStream(s) thats behind this analyzer? The bug 
is there (and it is a bug, if it is not working correctly). The Lucene internal 
analyzers don't have this problem, as the TokenStreams behind that are 
implemented and tested. In most cases those problems appear, if the underlying 
TokenStream does not correctly implement token caching (with captureState or 
cloneAttributes) or fails to implement reset() correctly.


The problem you have is: In Lucene 4.x it is *required* for a TokenStream to 
support reusing, so reset() must be implemented and restore a consistent state.

If you want to test your custom TokenStreams analyzers, you should use 
BaseTokenStreamTestCase from the Lucene test-framework. It wil shouw you all 
misuse of APIs inside TokenStream implementations (like incorrectly caching 
tokens not using captureState/restoreState and so on).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: zzT [mailto:zis....@gmail.com]
> Sent: Thursday, July 11, 2013 9:31 AM
> To: java-user@lucene.apache.org
> Subject: Lucene 4.0 tokenstream logic
> 
> Hi all,
> 
> I'm migrating from Lucene 3.6.1 to 4.3.1 and there seems to be a major
> change in how analyzers work....
> Given the code example below (which is almost copied from
> http://lucene.apache.org/core/4_3_1/core/index.html)
> 
> @Test
> public void testAnalysis() throws IOException {
>         final String[] texts = {"demo", "TokenStream", "API"};
>         CustomAnalyzer analyzer = new CustomAnalyzer(IndexLocale.ENGLISH,
> false);
> 
>         for (String text : texts) {
>             TokenStream stream = analyzer.tokenStream("field", new
> StringReader(text));
>             CharTermAttribute termAtt =
> stream.addAttribute(CharTermAttribute.class);
> 
>             try {
>                 stream.reset();
>                 while (stream.incrementToken()) {
>                     System.out.println("Token : " + termAtt.toString());
>                 }
>                 stream.end();
>             } finally {
>                 stream.close();
>             }
>         }
>     }
> 
> The output is the following
> in 3.6.1 :
> Token : demo
> Token : Tokenstream
> Token : API
> 
> while in
> 4.3.1 :
> Token : demo
> 
> This is happening because of the ReuseStrategy that is now embedded
> inside Analyzer.TokenStream which caches the 1st token ("demo") and
> reuses this one afterwards.
> 
> CustomAnalyzer is a custom analyzer :) and its implementation is irrelevant to
> the question (apart from the fact that in 3.6.1 it overrides
> tokenStream() while in 4.3.1 it overrides createComponents() ). I'm pretty
> sure the same is happening with Lucene's analyzers too.
> 
> The question is : Do I need to change something in my logic to make it work
> as in 3.6.1? The only way to get the same output is by initializing
> CustomAnalyzer before calling tokenstream().
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Lucene-
> 4-0-tokenstream-logic-tp4077203.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Lucene 4.0 tokenstream logic

Reply via email to