[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

Luis Filipe Nassif (JIRA) Tue, 17 Oct 2017 16:02:23 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208554#comment-16208554
 ]


Luis Filipe Nassif commented on TIKA-2478:
------------------------------------------

Although I have seen in the past emls with very different content in text and 
html bodies, that is very rare. So I agree to extrat only one version in the 
suggested order. Making that configurable is other option...

It also seems more natural to me to extract eml body inline, instead of handing 
them as embedded/attached documents, which they are not.

> MBOX import includes redundant copies of the text
> -------------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Priority: Minor
>
> MBOX messages often get parsed into four documents:
> a.    The mbox file - outer container "/"
> b.    The actual email--  "/embedded-1"
> c.    The utf-8 text content of the email "/embedded-1/embedded-2"
> d.    The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

Reply via email to