August Valera created TIKA-4307: ----------------------------------- Summary: Text in header not extracted for Microsoft Word doc file Key: TIKA-4307 URL: https://issues.apache.org/jira/browse/TIKA-4307 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.9.2 Reporter: August Valera Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, 560702J-full-output.txt, 560702J.doc
We have a Microsoft Word doc file with text in the header. That header text is not successfully extracted alongside the file content, but converting the file to a docx file results in successful extraction. Samples are attached, conversion done using cloudconvert.com. * [^560702J.doc] Original doc file, missing content * [^560702J-converted.docx] Converted to docx file, correct output * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing content h3. Current Behavior doc files omit header text. docx files extract header text correctly. h3. Expected Behavior doc and docx files with identical content in header should result in identical output -- This message was sent by Atlassian Jira (v8.20.10#820010)