Not directly Lucene related, but I'm out of ideas and I'm not a Russian speaker...
I'm extracting text from RTF to pump into Lucene. I'm using the original RTFEditorKit() code shown in LIA, p252 (actually, it's Nutch's RTFParser) I have an RTF document, which starts with --- {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset204{\*\fname Times New Roman;}Times New Roman CYR;}{\f1\fswiss\fprq2\fcharset0 Arial;}} {\colortbl ;\red0\green0\blue128;\red0\green0\blue0;} \viewkind4\uc1\pard\tx360\cf1\f0\fs20\'c1\'ee\'eb\'fc\'f8\'e8\'ed\'f1\'f2\'e2\'ee --- which should be 'Большинство', but when the RTFReader translationTable always maps the RTF bytes to char using latin1 and it never sets the correct translationTable. The "fcharset204" is Russian, apparently CP1251, but there's a lovely line in the RTFReader class /* TODO: per-font font encodings ( \fcharset control word ) ? */ Does anyone know if the RTF above is correct - the only place the translation table is set during the parse is when the 'ansi' keyword is set. Other than that, anyone have any ideas about getting the text out of the RTF properly? Thanks Antony --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]