[ https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950471#comment-17950471 ]
Hudson commented on TIKA-4345: ------------------------------ SUCCESS: Integrated in Jenkins build Tika ยป tika-branch_3x-jdk11 #2031 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/2031/]) TIKA-4345 -- add back configurability for injecting headers into the body of emails (legacy pre-4.x behavior) (tallison: [https://github.com/apache/tika/commit/67410849203b82d050e5bea5dfeb35d012db4bb6]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java TIKA-4345 -- add back configurability for injecting headers into the body of emails (legacy pre-4.x behavior) (tallison: [https://github.com/apache/tika/commit/c7f39fa8f7c07b46ec893bc73bf1cc97eaf1ee4c]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java TIKA-4345 -- checkstyle (tallison: [https://github.com/apache/tika/commit/5067383246afd3a0974bfb0973ec23308a638476]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java > Allow body-only content extraction for msg and other email formats > ------------------------------------------------------------------ > > Key: TIKA-4345 > URL: https://issues.apache.org/jira/browse/TIKA-4345 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Minor > > At least in the OutlookExtractor, we're writing some of the headers into the > content stream. For some use cases, it would be helpful to extract only the > body content into the content stream. > Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers > that need to be modified. We're not writing the from/to etc in the > RFC822Parser into the content stream. > I propose that this be a non-breaking/opt-in option in 3.x, and then the > default in 4.x. > In thinking about this more, I think we should get rid of injection of the > header info into the content in msg files in 4.x. If users want it, we can > add it back and do it correctly -- in .eml, outlook and pst. What troubles me > about this behavior is that that we currently have it only msg. If we want to > make it a feature, we should support it in the same way across all email > formats. > So, for 3.x, I propose that we allow users to turn this off in msg files. For > 4.x, we just won't do it...unless someone opens a ticket. > Let me know what you think/if there are any objections. -- This message was sent by Atlassian Jira (v8.20.10#820010)