tokenizing text using language analyzer but preserving stopwords if possible

Ilya Zavorin Tue, 06 Dec 2011 14:41:30 -0800

I need to implement a "quick and dirty" or "poor man's" translation of a 
foreign language document by looking up each word in a dictionary and replacing 
it with the English translation. So what I need is to tokenize the original 
foreign text into words and then access each word, look it up and get its 
translation. However, if possible, I also need to preserve "non-words", i.e. 
stopwords so that I could replicate them in the output stream without 
translating. If the latter is not possible then I just need to preserve the 
order of the original words so that their translations have the same order in 
the output.


Can I accomplish this using Lucene components? I presume I'd have to start by 
creating an analyzer for the foreign language, but then what? How do I (i) 
tokenize, (ii) access words in the correct order, (iii) also access non-words 
if possible?

Thanks much


Ilya Zavorin

tokenizing text using language analyzer but preserving stopwords if possible

Reply via email to