[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217509#comment-16217509 ]
Tim Allison edited comment on TIKA-2478 at 10/24/17 7:30 PM: ------------------------------------------------------------- First patch. This incorporates the test file from TIKA-2471 and [~kkrugler]'s test files. Thank you! While this change will make the behavior equivalent to the OutlookParser and how it handles multiple bodies, it will be a pretty big breaking change. Given the complexity of this patch, and the breaking change-ness of it, I'm tempted to hold off until Tika 2.0. Any and all feedback is welcomed. Thank you! was (Author: talli...@mitre.org): First patch. This incorporates the test file from TIKA-2471 and [~kkrugler]'s test files. Thakn you! While this change will make the behavior equivalent to the OutlookParser and how it handles multiple bodies, it will be a pretty big breaking change. Given the complexity of this patch, and the breaking change-ness of it, I'm tempted to hold off until Tika 2.0. Any and all feedback is welcomed. Thank you! > MBOX import includes redundant copies of the text > ------------------------------------------------- > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug > Affects Versions: 1.16 > Reporter: Robert Letzler > Assignee: Tim Allison > Priority: Minor > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a. The mbox file - outer container "/" > b. The actual email-- "/embedded-1" > c. The utf-8 text content of the email "/embedded-1/embedded-2" > d. The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)