RE: Writing a TokenConcatenateFilter - junk characters appearing on output.

Uwe Schindler Fri, 30 Sep 2011 15:46:11 -0700

Hi,

The junk is appended here: buffer.append(termAtt.buffer());


I assume you are on Lucene 3.1+, so use buffer.append(termAtt); termAtt
implements CharSequence, so it can be appended to any StringBuilder.
The code you are using appends the whole char array, which may contain
characters after termAtt.length().

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Jithin [mailto:jithin1...@gmail.com]
> Sent: Friday, September 30, 2011 11:12 PM
> To: java-user@lucene.apache.org
> Subject: Writing a TokenConcatenateFilter - junk characters appearing on
> output.
> 
> Hi,
> I am trying to write a TokenFilter which just concatenates all the the
token in
> the input TokenStream.
> Issue I am facing is that my filter is outputting certain junk characters
in
> addition to the concatenated string. I believe this is caused by
StringBuilder.
> 
> This is my incrementToken() function
> 
> public boolean incrementToken() throws IOException {
>         //if (!input.incrementToken()) {
>             //return false;
>         //}
>         if (finished) {
>             logger.error("Finished");
>             return false;
>         }
>         logger.error("Starting");
>         StringBuilder buffer = new StringBuilder();
>         int length = 0;
>         while (input.incrementToken()) {
>             logger.error(Integer.toString(buffer.length()));
>             logger.error(buffer.toString());
>             if (0 == length) {
>                 buffer.append(termAtt.buffer());
>                length += termAtt.length();
>             } else {
>                 buffer.append(" ").append(termAtt.buffer());
>                length += termAtt.length() + 1;
>             }
> 
>         }
> 
>         logger.error("####### Final");
>         logger.error(Integer.toString(buffer.length()));
>         logger.error(Integer.toString(length));
>         logger.error(buffer.toString());
> 
>         termAtt.setEmpty().append(buffer);
>         offsetAtt.setOffset(0, length);
>         finished = true;
>         return true;
>     }
> 
> 
> *Output for input tokens booh and good is *
> 
> SEVERE: Starting
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 0
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE:
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 14
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: booh
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: ####### Final
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 29
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 9
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: booh good
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: Finished
> 
> 
> And this is it is appearing on solr analysis
> page.(http://localhost:8983/solr/admin/analysis.jsp)
> org.ctown.solr.analysis.CTConcatFilterFactory
> {luceneMatchVersion=LUCENE_34}
> position      1
> *term text    booh#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;
> good#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;*
> startOffset   0
> endOffset     9
> 
> Kindlt help me in understanding what I am doing wrong and how to fix this.
> 
> 
> 
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Writing-a-
> TokenConcatenateFilter-junk-characters-appearing-on-output-
> tp3383684p3383684.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Writing a TokenConcatenateFilter - junk characters appearing on output.

Reply via email to