[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208769#comment-16208769
 ] 

Luis Filipe Nassif edited comment on TIKA-2471 at 10/18/17 3:47 AM:
--------------------------------------------------------------------

Hi Matthew,

If I remember correctly, some headers were not being extracted by the 
RFC822PARSER after the refactoring of Mboxparser, so that logic was added to 
get the missed headers back, right [~thaichat04]? I think it may be better to 
fix the RFC822PARSER instead.

The windows-1252 charset was used initially to facilitate locating newlines and 
the "From" delimiter, I think it does not corrupt content because the chars are 
converted back to bytes and there is one to one mapping between chars and bytes 
with this charset. But I think that shouldn't be added to contentType metadata 
of mbox container.


was (Author: lfcnassif):
Hi Matthew,

If I remember correctly, some headers were not being extracted by the 
RFC822PARSER after the refactoring of Mboxparser, so that logic was added to 
get the missed headers back, right [~thaichat04]? I think it may be better to 
fix the RFC822PARSER instead.

The windows-1252 charset was used initially to facilitate locating newlines and 
the "From" delimiter. I think that shouldn't be added to contentType metadata 
of mbox container.

> Tab-prefixed message body lines in Mbox interpreted as headers
> --------------------------------------------------------------
>
>                 Key: TIKA-2471
>                 URL: https://issues.apache.org/jira/browse/TIKA-2471
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>              Labels: message, rfc822
>         Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to