PDF and Outlook docs embedded in MS Word documents not parsed -------------------------------------------------------------
Key: TIKA-704 URL: https://issues.apache.org/jira/browse/TIKA-704 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Environment: Windows 7 64-bit Reporter: Jeremy Anderson Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there). >From what I see, when these embedded objects are parsed, they're initially >identified as vnd.openxmlformats-officedocument.oleObject in the metadata's >Content-Type field. After a call to the RecurciveParsers super parse class the >Content-Types update to the following: PDF's: application/vnd.ms-works .MSG: application/x-tika-msoffice The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira