[ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208769#comment-16208769 ]
Luis Filipe Nassif edited comment on TIKA-2471 at 10/18/17 3:47 AM: -------------------------------------------------------------------- Hi Matthew, If I remember correctly, some headers were not being extracted by the RFC822PARSER after the refactoring of Mboxparser, so that logic was added to get the missed headers back, right [~thaichat04]? I think it may be better to fix the RFC822PARSER instead. The windows-1252 charset was used initially to facilitate locating newlines and the "From" delimiter, I think it does not corrupt content because the chars are converted back to bytes and there is one to one mapping between chars and bytes with this charset. But I think that shouldn't be added to contentType metadata of mbox container. was (Author: lfcnassif): Hi Matthew, If I remember correctly, some headers were not being extracted by the RFC822PARSER after the refactoring of Mboxparser, so that logic was added to get the missed headers back, right [~thaichat04]? I think it may be better to fix the RFC822PARSER instead. The windows-1252 charset was used initially to facilitate locating newlines and the "From" delimiter. I think that shouldn't be added to contentType metadata of mbox container. > Tab-prefixed message body lines in Mbox interpreted as headers > -------------------------------------------------------------- > > Key: TIKA-2471 > URL: https://issues.apache.org/jira/browse/TIKA-2471 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.16 > Reporter: Matthew Caruana Galizia > Labels: message, rfc822 > Attachments: mbox > > > The mbox parser code is overly optimistic. It parses the entire message > looking for anything that matches a header pattern, wherever it occurs in a > line! > It looks to me like the parsing logic is in desperate need of a refactor. But > more to the point, what is the idea behind setting the headers in the > MboxParser if they're going to be set by the RFC822Parser in any case? > Also, out of curiosity, why does the parser force Windows-1252 as the charset? -- This message was sent by Atlassian JIRA (v6.4.14#64029)