[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

Nick Burch (Jira) Sun, 05 Jun 2022 07:17:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550216#comment-17550216
 ]


Nick Burch commented on TIKA-3768:
----------------------------------

I wouldn't expect to find those in the textual content after parsing, those 
fields should be ending up in the Metadata object instead

We have a bunch of unit tests for mail parsing which shows that, for our test 
files at least, that subject + from + to all coming through, see 
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java]

Are you able to compare your code with that in the unit test, and see any 
differences between the working test and yours? Bonus marks if you can write a 
small failing junit unit test that shows the issue with your file....

> message/rfc822 does not include Headers in extracted text
> ---------------------------------------------------------
>
>                 Key: TIKA-3768
>                 URL: https://issues.apache.org/jira/browse/TIKA-3768
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.0
>            Reporter: Sam Stephens
>            Priority: Major
>         Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

Reply via email to