Gregory Lepore created TIKA-4447: ------------------------------------ Summary: eml attachement duplicate filename on extract Key: TIKA-4447 URL: https://issues.apache.org/jira/browse/TIKA-4447 Project: Tika Issue Type: Bug Affects Versions: 3.2.0 Reporter: Gregory Lepore Attachments: 12.eml
Not sure if this is a bug or something wrong with the source files. I'm extracting and analyzing attachments from a huge set of eml files (originally in pst format). However, attachments are getting the filename doubled on extraction. For example, for the attached eml file I get: java -jar /media/lepore/Work/tika/tika.jar --extract 12.eml Extracting 'rtf-body.rtfrtf-body.rtf' (application/rtf) to ./cc9d8ebd-b93c-4235-b766-79b0aa841ef2-rtf-body.rtfrtf-body.rtf Extracting '03-005 ACF GA Plan1.doc03-005 ACF GA Plan1.doc' (application/msword) to ./0220432f-6dcc-4beb-b659-66be0fe0f60f-03-005 ACF GA Plan1.doc03-005 ACF GA Plan1.doc Extracting 'Talking Point1 1-17.docTalking Point1 1-17.doc' (application/msword) to ./24bbaeab-448e-4d47-8b6d-ee9651156f89-Talking Point1 1-17.docTalking Point1 1-17.doc All of the extracted file names are doubled. In the eml file I see: Content-Type: application/msword Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*=utf-8''Talking%20Point1%201-17.doc; filename="Talking Point1 1-17.doc" perhaps the doubled filename here is contributing to the problem? Extracting the files with pffexport doesn't double the filename, but ripmime has trouble, and munpack also has trouble. -- This message was sent by Atlassian Jira (v8.20.10#820010)