That was an easy fix. Everything works as expected now. Thanks again. -----Original Message----- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Thursday, December 05, 2013 1:46 PM To: java-user@lucene.apache.org Subject: RE: Analyzers aren't reusable?? (lucene 4.2.1)
The problem is the CharFilter, which cannot be reused. To correctly implement the Analyzer do the wrapping of the incoming Reader in the protected initReader():http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/Analyzer.html#initReader(java.lang.String, java.io.Reader). In createComponents() only take the Reader from the param and create the Tokenizer+TokenFilters (which can be reused). initReader() ensures that every call to "tokenStream" creates a new Reader and passes it to the reused Tokenizer. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Scott Smith [mailto:ssm...@mainstreamdata.com] > Sent: Thursday, December 05, 2013 9:36 PM > To: java-user@lucene.apache.org > Subject: Analyzers aren't reusable?? (lucene 4.2.1) > > I wrote the following to demonstrate what for me was surprising > behavior (this is Lucene 4.2.1). If you want to run this yourself, > you should be able to since there are no references to anything other > than standard lucene and java libraries. Basically, this is an > analyzer that makes everything lowercase and strip all of the html tags. > > public final class DemoAnalyzer extends StopwordAnalyzerBase { > public DemoAnalyzer() > { > super(Version.LUCENE_42); > } > > @Override > protected TokenStreamComponents createComponents(String fieldName, > Reader reader) > { > final Tokenizer source = new StandardTokenizer(Version.LUCENE_42, > new > HTMLStripCharFilter(reader)); > TokenStream result = new LowerCaseFilter(Version.LUCENE_42, source); > return new TokenStreamComponents(source, result); > } > > // this is just a debug routine to display some results. > public static String getTokenStream(String a_zText, Analyzer > a_zAnalyzer) throws IOException > { > TokenStream stream; > CharTermAttribute attr; > stream = a_zAnalyzer.tokenStream(null, new StringReader(a_zText)); > stream.reset(); > StringBuffer sb = new StringBuffer(); > sb.append(a_zAnalyzer.toString()); > sb.append("::"); > while(stream.incrementToken()) > { > attr = stream.getAttribute(CharTermAttribute.class); > if (sb.length() > 0) > { > sb.append(' '); > } > sb.append(attr.toString()); > } > > return "original String: " + a_zText + "\n" + sb.toString(); > } > > > public static void main(String[] args) throws IOException > { > String text = "<p>This is a <b>TEST</b> of the demo analyzer</p>"; > Analyzer a = new DemoAnalyzer(); > > System.out.println(getTokenStream(text, a)); > > System.out.println(getTokenStream(text, a)); > > System.out.println(getTokenStream(text, new DemoAnalyzer())); > } > } > > When I run this, I get the following output: > > original String: <p>This is a <b>TEST</b> of the demo analyzer</p> > com.somedomain.DemoAnalyzer@5d3f79f7:: this is a test of the demo > analyzer > > original String: <p>This is a <b>TEST</b> of the demo analyzer</p> > com.somedomain.DemoAnalyzer@5d3f79f7:: p this is a b test b of the > demo analyzer p > > original String: <p>This is a <b>TEST</b> of the demo analyzer</p> > com.somedomain.DemoAnalyzer@138532dc:: this is a test of the demo > analyzer > > The critical line is the second of each of the 3 pairs. Note the line > in case 2 (of 3). Rather than stripping the entire html tag, it's just > stripping the "<" and > "/>". Is this expected behavior? I thought analyzers were thread-safe and > reusable. Am I wrong on that point? I would expect the output of all > three to be the same. > > Can someone explain to me what's going on? What am I missing? > > Scott --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org