[ 
https://issues.apache.org/jira/browse/TIKA-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850936#comment-16850936
 ] 

Tim Allison commented on TIKA-2883:
-----------------------------------

I have a local fix that works for all four issues.  I'll push that once I get a 
clean local build.

There's still the remaining item for future improvements that we're pretty much 
guessing when we're at the end of the header by whether we see {{par}} or other 
text-y kinds of things.

According to one RTF spec, this is what a header can look like, with ? for 
optional, obviously.
{noformat}
<header>        \rtf <charset> <deffont> \deff? <fonttbl> <filetbl>? 
<colortbl>? <stylesheet>? <listtables>? <revtbl>? <rsidtable>? <generator>?
{noformat}

The obnoxious part is that there can be stuff in between those items, and I'm 
hesitant to trust that RTFs follow the spec and actually require that order, 
etc...


> Text not extracted from RTF files
> ---------------------------------
>
>                 Key: TIKA-2883
>                 URL: https://issues.apache.org/jira/browse/TIKA-2883
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1, 1.21
>            Reporter: Luis Filipe Nassif
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: Message (5).rtf
>
>
> I have a number of RTF files (extracted fromĀ PST email bodies) which text is 
> not extracted currently. Sample file attached. [~talli...@apache.org], do you 
> have any ideia what is going on?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to