[ 
https://issues.apache.org/jira/browse/TIKA-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897976#comment-17897976
 ] 

Hudson commented on TIKA-4345:
------------------------------

SUCCESS: Integrated in Jenkins build Tika ยป tika-branch_3x-jdk11 #1885 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/1885/])
TIKA-4345 (#2037) (tallison: 
[https://github.com/apache/tika/commit/3b22c9062e2f60133e3cb850a9997b4beea4c257])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/TextExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/pst/OutlookPSTParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/pst/PSTMailItemParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/RTFParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
TIKA-4345 -- allow users to turn off the injection of headers into the content 
stream for MSG files. (tallison: 
[https://github.com/apache/tika/commit/37e72fcc94b2729cb1627264f18bc12485b47dbc])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
* (edit) CHANGES.txt


> Allow body-only content extraction for msg and other email formats
> ------------------------------------------------------------------
>
>                 Key: TIKA-4345
>                 URL: https://issues.apache.org/jira/browse/TIKA-4345
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> At least in the OutlookExtractor, we're writing some of the headers into the 
> content stream. For some use cases, it would be helpful to extract only the 
> body content into the content stream.
> Looks like OutlookExtractor and maybe OutlookPSTParser are the only parsers 
> that need to be modified. We're not writing the from/to etc in the 
> RFC822Parser into the content stream.
> I propose that this be a non-breaking/opt-in option in 3.x, and then the 
> default in 4.x.
> In thinking about this more, I think we should get rid of injection of the 
> header info into the content in msg files in 4.x. If users want it, we can 
> add it back and do it correctly -- in .eml, outlook and pst. What troubles me 
> about this behavior is that that we currently have it only msg. If we want to 
> make it a feature, we should support it in the same way across all email 
> formats.
> So, for 3.x, I propose that we allow users to turn this off in msg files. For 
> 4.x, we just won't do it...unless someone opens a ticket.
> Let me know what you think/if there are any objections.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to