Yes, I ended up doing essentially that. No need to tokenize, I basically split the input string into a sequence of alternating "word" and "nonword" tokens based on Character.isLetter() and then looked up the words
Ilya -----Original Message----- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Monday, January 16, 2012 5:50 AM To: java-user@lucene.apache.org Subject: Re: how to preserve whitespaces etc when tokenizing stream? Maybe you could simply use String.replace()? Or the text actually needs to be tokenized? On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin <izavo...@caci.com> wrote: > I am trying to perform a "translation" of sorts of a stream of text. More > specifically, I need to tokenize the input stream, look up every term in a > specialized dictionary and output the corresponding "translation" of the > token. However, i also want to preserve all the original whitespaces, > stopwords etc from the input so that the output is formatted in the same > way as the input instead of ended up being a stream of translations. So if > my input is > > > > <term1>: <term2> <stopword>! <term3> > > <term4> > > > > then I want the output to look like > > > > <term1'>: <term2'> <stopword>! <term3'> > > <term4'> > > > > (where <termi'> is translation of <termi>) instead of > > > > <term1'> <term2'> <term3'> <term4'> > > > > Currently I am doing the following: > > > > PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31, > > > PatternAnalyzer.WHITESPACE_PATTERN, > > false, > > WordlistLoader.getWordSet(new > File(stopWordFilePath))); > > TokenStream ts = pa.tokenStream(null, in); > > CharTermAttribute charTermAttribute = > ts.getAttribute(CharTermAttribute.class); > > > > while (ts.incrementToken()) { // loop over tokens > > String termIn = charTermAttribute.toString(); > > ... > > } > > > > but this, of course, loses all the whitespaces etc. How can I modify this > to be able to re-insert them into the output? thanks much! > > > Thanks, > > Ilya >