[ https://issues.apache.org/jira/browse/TIKA-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850119#comment-16850119 ]
Ross Johnson commented on TIKA-2883: ------------------------------------ I know a bit about these types of files. Outlook / Exchange will often store messages as RTF-encapsulated HTML. This is a mixed-representation of the text & formatting, such that a conforming RTF reader sees it as just a normal RTF file and ignores the HTML tags, while a special RTF de-encapsulator reader can still read the original HTML tags and ignore the other RTF operators. The actual body text / content is only included a single time and is shared between both representations. A conforming RTF reader should not have to do anything special to get the text or ignore the HTML tags. There is also such a thing as RTF-encapsulated plain text, which is similar to RTF-encapsulated HTML. If Tika is not giving any text output for this file, then there is probably a bug in the RTF reader that is being used. Perhaps it is getting hung up on the various HTML control words that is doesn't know how to handle, when it should instead be ignoring them. Source: I wrote a (non-Java) RTF de-encapsulator for text and HTML [https://github.com/mazira/rtf-stream-parser] > Text not extracted from RTF files > --------------------------------- > > Key: TIKA-2883 > URL: https://issues.apache.org/jira/browse/TIKA-2883 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.20, 1.19.1, 1.21 > Reporter: Luis Filipe Nassif > Priority: Major > Attachments: Message (5).rtf > > > I have a number of RTF files (extracted fromĀ PST email bodies) which text is > not extracted currently. Sample file attached. [~talli...@apache.org], do you > have any ideia what is going on? -- This message was sent by Atlassian JIRA (v7.6.3#76005)