how to preserve whitespaces etc when tokenizing stream?

Ilya Zavorin Fri, 13 Jan 2012 08:45:06 -0800

I am trying to perform a "translation" of sorts of a stream of text. More 
specifically, I need to tokenize the input stream, look up every term in a 
specialized dictionary and output the corresponding "translation" of the token. 
However, i also want to preserve all the original whitespaces, stopwords etc 
from the input so that the output is formatted in the same way as the input 
instead of ended up being a stream of translations. So if my input is




<term1>: <term2> <stopword>! <term3>

<term4>



then I want the output to look like



<term1'>: <term2'> <stopword>! <term3'>

<term4'>



(where <termi'> is translation of <termi>) instead of



<term1'> <term2'> <term3'> <term4'>



Currently I am doing the following:



PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,

                                           PatternAnalyzer.WHITESPACE_PATTERN,

                                           false,

                                           WordlistLoader.getWordSet(new 
File(stopWordFilePath)));

TokenStream ts = pa.tokenStream(null, in);

CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);



while (ts.incrementToken()) { // loop over tokens

       String termIn = charTermAttribute.toString();

       ...

}



but this, of course, loses all the whitespaces etc. How can I modify this to be 
able to re-insert them into the output? thanks much!


Thanks,

Ilya

how to preserve whitespaces etc when tokenizing stream?

Reply via email to