I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
<term1>: <term2> <stopword>! <term3> <term4> then I want the output to look like <term1'>: <term2'> <stopword>! <term3'> <term4'> (where <termi'> is translation of <termi>) instead of <term1'> <term2'> <term3'> <term4'> Currently I am doing the following: PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31, PatternAnalyzer.WHITESPACE_PATTERN, false, WordlistLoader.getWordSet(new File(stopWordFilePath))); TokenStream ts = pa.tokenStream(null, in); CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class); while (ts.incrementToken()) { // loop over tokens String termIn = charTermAttribute.toString(); ... } but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much! Thanks, Ilya