RE: Confusion with Analyzer.tokenStream() re-use in 4.1

Uwe Schindler Wed, 27 Feb 2013 11:03:11 -0800

The problem here is that the tokenstream is instantiated in the same thread 
from 2 different code paths and consumed later. If you add fields, the indexer 
will fetch a new reused TokenStream one after each other and consume them 
directly after getting. It will not interleave this. In your case, the second 
field is instantiated using a TokenStream, which is already initialized. 
Unfortunately, if you ask the analyzer for another TokenStream later, the 
already opened one gets invalid (the second field).


Don't use new Field(name, TokenStream) with TokenStreams from Analyzers, 
because they are only "valid" for a very short time. If you need to do this, 
use a second Analyzer instance. If you add Fields with a String value, the 
TokenStream is created on they fly  and is be consumed by the DocumentsWriter 
directly after getting it.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Konstantyn Smirnov [mailto:inject...@yahoo.com]
> Sent: Wednesday, February 27, 2013 6:25 PM
> To: java-user@lucene.apache.org
> Subject: Confusion with Analyzer.tokenStream() re-use in 4.1
> 
> Dear all,
> 
> I'm using the following test-code:
> 
> Document doc = new Document()
> Analyzer a = new SimpleAnalyzer( Version.LUCENE_41 )
> 
> TokenStream inputTS = a.tokenStream( 'name1', new StringReader( 'aaa bbb
> ccc' ) ) Field f = new TextField( 'name1', inputTS ) doc.add f
> 
> TokenStream ts = doc.getField( 'name1' ).tokenStreamValue()
> ts.reset()
> 
> String sb = ''
> while( ts.incrementToken() ) sb += ts.getAttribute( CharTermAttribute ) + '|'
> assert 'aaa|bbb|ccc|' == sb
> 
> inputTS = a.tokenStream( 'name2', new StringReader( 'xxx zzz' ) ) f = new
> TextField( 'name2', inputTS ) doc.add f
> 
> TokenStream ts = doc.getField( 'name2' ).tokenStreamValue()
> ts.reset()
> 
> sb = ''
> while( ts.incrementToken() ) sb += ts.getAttribute( CharTermAttribute ) + '|'
> assert 'xxx|zzz|' == sb // << FAILS! -> sb == '' and ts.incrementTokent() ==
> false
> 
> The 1st added field lets read it's tokentStreamValue() tokens, all subsequent
> calls bring nothing, unless I re-instantiate the analyzer.
> 
> Another strange thing is, that just before adding a new field to the
> document, the tokenStream is filled..
> 
> What am I doing wrong?
> 
> TIA
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Confusion-with-Analyzer-
> tokenStream-re-use-in-4-1-tp4043427.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Confusion with Analyzer.tokenStream() re-use in 4.1

Reply via email to