[jira] [Created] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

August Valera (Jira) Mon, 09 Sep 2024 16:28:37 -0700

August Valera created TIKA-4307:
-----------------------------------

             Summary: Text in header not extracted for Microsoft Word doc file
                 Key: TIKA-4307
                 URL: https://issues.apache.org/jira/browse/TIKA-4307
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.9.2
            Reporter: August Valera
         Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, 
560702J-full-output.txt, 560702J.doc


We have a Microsoft Word doc file with text in the header. That header text is 
not successfully extracted alongside the file content, but converting the file 
to a docx file results in successful extraction.

Samples are attached, conversion done using cloudconvert.com.
 * [^560702J.doc] Original doc file, missing content
 * [^560702J-converted.docx] Converted to docx file, correct output
 * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing 
content

h3. Current Behavior

doc files omit header text. docx files extract header text correctly.
h3. Expected Behavior

doc and docx files with identical content in header should result in identical 
output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

Reply via email to