Im trying to construct a plaintext parser for different file formats like ms word, excel, powerpoint, rich text format, plain text, html, pdf etc.
I use the known libraries PDFBox, POI and some parts from AtLeap...and now I should support the OpenOffice formats and the more important msg-fromat (MS outlook message format). Does someone know how I can simply (like POI) extract plaint text from msg? Probably there exists an open source library like for pdf or ms office files? I need the plain text because the only way for me seems to extract all the plain text from every single document, and then add it to my lucene index...this is necessary to get the best excerpt from highlighter... Thx Simon Dietschi -- View this message in context: http://www.nabble.com/Lucene---FileFormat-t1485959.html#a4024568 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]