RE: Confusion with Analyzer.tokenStream() re-use in 4.1

Uwe Schindler Wed, 27 Feb 2013 11:04:42 -0800

In addition, in your first field you are using StringReader to feed in the data 
which can only be consumed once. This has nothing to do with TokenStream reuse.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Wednesday, February 27, 2013 8:03 PM
> To: 'java-user@lucene.apache.org'
> Subject: RE: Confusion with Analyzer.tokenStream() re-use in 4.1
> 
> The problem here is that the tokenstream is instantiated in the same thread
> from 2 different code paths and consumed later. If you add fields, the
> indexer will fetch a new reused TokenStream one after each other and
> consume them directly after getting. It will not interleave this. In your 
> case,
> the second field is instantiated using a TokenStream, which is already
> initialized. Unfortunately, if you ask the analyzer for another TokenStream
> later, the already opened one gets invalid (the second field).
> 
> Don't use new Field(name, TokenStream) with TokenStreams from
> Analyzers, because they are only "valid" for a very short time. If you need to
> do this, use a second Analyzer instance. If you add Fields with a String 
> value,
> the TokenStream is created on they fly  and is be consumed by the
> DocumentsWriter directly after getting it.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Konstantyn Smirnov [mailto:inject...@yahoo.com]
> > Sent: Wednesday, February 27, 2013 6:25 PM
> > To: java-user@lucene.apache.org
> > Subject: Confusion with Analyzer.tokenStream() re-use in 4.1
> >
> > Dear all,
> >
> > I'm using the following test-code:
> >
> > Document doc = new Document()
> > Analyzer a = new SimpleAnalyzer( Version.LUCENE_41 )
> >
> > TokenStream inputTS = a.tokenStream( 'name1', new StringReader( 'aaa
> > bbb ccc' ) ) Field f = new TextField( 'name1', inputTS ) doc.add f
> >
> > TokenStream ts = doc.getField( 'name1' ).tokenStreamValue()
> > ts.reset()
> >
> > String sb = ''
> > while( ts.incrementToken() ) sb += ts.getAttribute( CharTermAttribute ) +
> '|'
> > assert 'aaa|bbb|ccc|' == sb
> >
> > inputTS = a.tokenStream( 'name2', new StringReader( 'xxx zzz' ) ) f =
> > new TextField( 'name2', inputTS ) doc.add f
> >
> > TokenStream ts = doc.getField( 'name2' ).tokenStreamValue()
> > ts.reset()
> >
> > sb = ''
> > while( ts.incrementToken() ) sb += ts.getAttribute( CharTermAttribute ) +
> '|'
> > assert 'xxx|zzz|' == sb // << FAILS! -> sb == '' and
> > ts.incrementTokent() == false
> >
> > The 1st added field lets read it's tokentStreamValue() tokens, all
> > subsequent calls bring nothing, unless I re-instantiate the analyzer.
> >
> > Another strange thing is, that just before adding a new field to the
> > document, the tokenStream is filled..
> >
> > What am I doing wrong?
> >
> > TIA
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Confusion-with-Analyzer-
> > tokenStream-re-use-in-4-1-tp4043427.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Confusion with Analyzer.tokenStream() re-use in 4.1

Reply via email to