[ https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880628#comment-17880628 ]
Tim Allison commented on TIKA-4307: ----------------------------------- I asked for help from fellow POI devs: https://bz.apache.org/bugzilla/show_bug.cgi?id=69314 > Text in header not extracted for Microsoft Word doc file > -------------------------------------------------------- > > Key: TIKA-4307 > URL: https://issues.apache.org/jira/browse/TIKA-4307 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.2 > Reporter: August Valera > Priority: Major > Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, > 560702J-full-output.txt, 560702J.doc, screenshot-1.png > > > We have a Microsoft Word doc file with text in the header. That header text > is not successfully extracted alongside the file content, but converting the > file to a docx file results in successful extraction. > Samples are attached, conversion done using cloudconvert.com. > * [^560702J.doc] Original doc file, missing content > * [^560702J-converted.docx] Converted to docx file, correct output > * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing > content > h3. Current Behavior > doc files omit header text. docx files extract header text correctly. > h3. Expected Behavior > doc and docx files with identical content in header should result in > identical output -- This message was sent by Atlassian Jira (v8.20.10#820010)