[ https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-4345: ------------------------------ Description: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option in 3.x, and then the default in 4.x. In thinking about this more, I think we should get rid of injection of the header info into the content in msg files in 4.x. If users want it, we can add it back and do it correctly -- in .eml, outlook and pst. What troubles me about this behavior is that that we currently have it only msg. If we want to make it a feature, we should support it in the same way across all email formats. So, for 3.x, I propose that we allow users to turn this off in msg files. For 4.x, we just won't do it...unless someone opens a ticket. Let me know what you think/if there are any objections. was: At least in the OutlookExtractor, we're writing some of the headers into the content stream. For some use cases, it would be helpful to extract only the body content into the content stream. Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers that need to be modified. We're not writing the from/to etc in the RFC822Parser into the content stream. I propose that this be a non-breaking/opt-in option in 3.x, and then the default in 4.x. In thinking about this more, I think we should get rid of injection of the header info into the content in msg files in 4.x. If users want it, we can add it back and do it correctly -- in .eml, outlook and pst. It is weird that we currently have it only msg. So, for 3.x, I propose that we allow users to turn this off in msg files. For 4.x, we just won't do it...unless someone opens a ticket. Let me know what you think/if there are any objections. > Allow body-only content extraction for msg and other email formats > ------------------------------------------------------------------ > > Key: TIKA-4345 > URL: https://issues.apache.org/jira/browse/TIKA-4345 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Minor > > At least in the OutlookExtractor, we're writing some of the headers into the > content stream. For some use cases, it would be helpful to extract only the > body content into the content stream. > Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers > that need to be modified. We're not writing the from/to etc in the > RFC822Parser into the content stream. > I propose that this be a non-breaking/opt-in option in 3.x, and then the > default in 4.x. > In thinking about this more, I think we should get rid of injection of the > header info into the content in msg files in 4.x. If users want it, we can > add it back and do it correctly -- in .eml, outlook and pst. What troubles me > about this behavior is that that we currently have it only msg. If we want to > make it a feature, we should support it in the same way across all email > formats. > So, for 3.x, I propose that we allow users to turn this off in msg files. For > 4.x, we just won't do it...unless someone opens a ticket. > Let me know what you think/if there are any objections. -- This message was sent by Atlassian Jira (v8.20.10#820010)