[ https://issues.apache.org/jira/browse/TIKA-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709353#comment-14709353 ]
Tim Allison commented on TIKA-1713: ----------------------------------- Thank you for opening this issue and attaching a mock example file. If there is anyway to mock up the full .msg file, that'd be great, too, but I understand that may not be possible. Out of curiosity, were the embedded documents processed correctly? Or, do you care about those? > RTF parser misses text content > ------------------------------- > > Key: TIKA-1713 > URL: https://issues.apache.org/jira/browse/TIKA-1713 > Project: Tika > Issue Type: Bug > Affects Versions: 1.10 > Reporter: Mike Cantrell > Assignee: Tim Allison > Attachments: no-text.rtf > > > We have a lot of Outlook msg files that have RTF body content. Tika is not > finding any text within these messages. It appears to be a mixture of RTF and > HTML. > I've extracted an example RTF body (see attachment) for use with the > following test case: > {code} > ByteArrayOutputStream bytes = new ByteArrayOutputStream() > rtfParser.parse( > this.class.getResourceAsStream("/problems/no-text.rtf"), > new EmbeddedContentHandler(new BodyContentHandler(bytes)), > new Metadata(), new ParseContext() > ); > assertTrue("Document is missing required text", bytes.toByteArray().length > > 0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)