[ https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897962#comment-17897962 ]
ASF GitHub Bot commented on TIKA-4345: -------------------------------------- tballison merged PR #2042: URL: https://github.com/apache/tika/pull/2042 > Allow body-only content extraction for msg and other email formats > ------------------------------------------------------------------ > > Key: TIKA-4345 > URL: https://issues.apache.org/jira/browse/TIKA-4345 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Minor > > At least in the OutlookExtractor, we're writing some of the headers into the > content stream. For some use cases, it would be helpful to extract only the > body content into the content stream. > Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers > that need to be modified. We're not writing the from/to etc in the > RFC822Parser into the content stream. > I propose that this be a non-breaking/opt-in option in 3.x, and then the > default in 4.x. > In thinking about this more, I think we should get rid of injection of the > header info into the content in msg files in 4.x. If users want it, we can > add it back and do it correctly -- in .eml, outlook and pst. What troubles me > about this behavior is that that we currently have it only msg. If we want to > make it a feature, we should support it in the same way across all email > formats. > So, for 3.x, I propose that we allow users to turn this off in msg files. For > 4.x, we just won't do it...unless someone opens a ticket. > Let me know what you think/if there are any objections. -- This message was sent by Atlassian Jira (v8.20.10#820010)