Hi, The junk is appended here: buffer.append(termAtt.buffer());
I assume you are on Lucene 3.1+, so use buffer.append(termAtt); termAtt implements CharSequence, so it can be appended to any StringBuilder. The code you are using appends the whole char array, which may contain characters after termAtt.length(). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Jithin [mailto:jithin1...@gmail.com] > Sent: Friday, September 30, 2011 11:12 PM > To: java-user@lucene.apache.org > Subject: Writing a TokenConcatenateFilter - junk characters appearing on > output. > > Hi, > I am trying to write a TokenFilter which just concatenates all the the token in > the input TokenStream. > Issue I am facing is that my filter is outputting certain junk characters in > addition to the concatenated string. I believe this is caused by StringBuilder. > > This is my incrementToken() function > > public boolean incrementToken() throws IOException { > //if (!input.incrementToken()) { > //return false; > //} > if (finished) { > logger.error("Finished"); > return false; > } > logger.error("Starting"); > StringBuilder buffer = new StringBuilder(); > int length = 0; > while (input.incrementToken()) { > logger.error(Integer.toString(buffer.length())); > logger.error(buffer.toString()); > if (0 == length) { > buffer.append(termAtt.buffer()); > length += termAtt.length(); > } else { > buffer.append(" ").append(termAtt.buffer()); > length += termAtt.length() + 1; > } > > } > > logger.error("####### Final"); > logger.error(Integer.toString(buffer.length())); > logger.error(Integer.toString(length)); > logger.error(buffer.toString()); > > termAtt.setEmpty().append(buffer); > offsetAtt.setOffset(0, length); > finished = true; > return true; > } > > > *Output for input tokens booh and good is * > > SEVERE: Starting > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: 0 > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: 14 > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: booh > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: ####### Final > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: 29 > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: 9 > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: booh good > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter > incrementToken > SEVERE: Finished > > > And this is it is appearing on solr analysis > page.(http://localhost:8983/solr/admin/analysis.jsp) > org.ctown.solr.analysis.CTConcatFilterFactory > {luceneMatchVersion=LUCENE_34} > position 1 > *term text booh#0;#0;#0;#0;#0;#0;#0;#0;#0;#0; > good#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;* > startOffset 0 > endOffset 9 > > Kindlt help me in understanding what I am doing wrong and how to fix this. > > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Writing-a- > TokenConcatenateFilter-junk-characters-appearing-on-output- > tp3383684p3383684.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org