[
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550216#comment-17550216
]
Nick Burch commented on TIKA-3768:
----------------------------------
I wouldn't expect to find those in the textual content after parsing, those
fields should be ending up in the Metadata object instead
We have a bunch of unit tests for mail parsing which shows that, for our test
files at least, that subject + from + to all coming through, see
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java]
Are you able to compare your code with that in the unit test, and see any
differences between the working test and yours? Bonus marks if you can write a
small failing junit unit test that shows the issue with your file....
> message/rfc822 does not include Headers in extracted text
> ---------------------------------------------------------
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.0
> Reporter: Sam Stephens
> Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents,
> such as the attached [^email.txt], the extracted text does not include any of
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm
> surprised it's not there based on the include everything bias I saw on
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a
> parser, my debugging appears to show
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we
> get the full text, but the returned content type is 'message/rfc822;
> charset=windows-1252'.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)