Gregory Lepore created TIKA-4447:
------------------------------------

             Summary: eml attachement duplicate filename on extract
                 Key: TIKA-4447
                 URL: https://issues.apache.org/jira/browse/TIKA-4447
             Project: Tika
          Issue Type: Bug
    Affects Versions: 3.2.0
            Reporter: Gregory Lepore
         Attachments: 12.eml

Not sure if this is a bug or something wrong with the source files. I'm 
extracting and analyzing attachments from a huge set of eml files (originally 
in pst format). However, attachments are getting the filename doubled on 
extraction. For example, for the attached eml file I get:

java -jar /media/lepore/Work/tika/tika.jar --extract  12.eml 
Extracting 'rtf-body.rtfrtf-body.rtf' (application/rtf) to 
./cc9d8ebd-b93c-4235-b766-79b0aa841ef2-rtf-body.rtfrtf-body.rtf 
Extracting '03-005 ACF GA Plan1.doc03-005 ACF GA Plan1.doc' 
(application/msword) to ./0220432f-6dcc-4beb-b659-66be0fe0f60f-03-005 ACF GA 
Plan1.doc03-005 ACF GA Plan1.doc 
Extracting 'Talking Point1 1-17.docTalking Point1 1-17.doc' 
(application/msword) to ./24bbaeab-448e-4d47-8b6d-ee9651156f89-Talking Point1 
1-17.docTalking Point1 1-17.doc

All of the extracted file names are doubled. In the eml file I see:

Content-Type: application/msword
Content-Transfer-Encoding: base64
Content-Disposition: attachment; 
        filename*=utf-8''Talking%20Point1%201-17.doc;
        filename="Talking Point1 1-17.doc"

perhaps the doubled filename here is contributing to the problem?




Extracting the files with pffexport doesn't double the filename, but ripmime 
has trouble, and munpack also has trouble.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to