OT: Parsing Russian text from RTF

Bowesman Antony Thu, 15 May 2008 19:53:52 -0700

Not directly Lucene related, but I'm out of ideas and I'm not a Russian 
speaker...


I'm extracting text from RTF to pump into Lucene.  I'm using the original 
RTFEditorKit() code shown in LIA, p252 (actually, it's Nutch's RTFParser)

I have an RTF document, which starts with

---
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset204{\*\fname
 
Times New Roman;}Times New Roman CYR;}{\f1\fswiss\fprq2\fcharset0 Arial;}}
{\colortbl ;\red0\green0\blue128;\red0\green0\blue0;}
\viewkind4\uc1\pard\tx360\cf1\f0\fs20\'c1\'ee\'eb\'fc\'f8\'e8\'ed\'f1\'f2\'e2\'ee
---

which should be 'Большинство', but when the RTFReader translationTable always 
maps the RTF bytes to char using latin1 and it never sets the correct 
translationTable.  The "fcharset204" is Russian, apparently CP1251, but there's 
a lovely line in the RTFReader class

/* TODO: per-font font encodings ( \fcharset control word ) ? */

Does anyone know if the RTF above is correct - the only place the translation 
table is set during the parse is when the 'ansi' keyword is set.

Other than that, anyone have any ideas about getting the text out of the RTF 
properly?

Thanks
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

OT: Parsing Russian text from RTF

Reply via email to