[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

Tim Allison (Jira) Tue, 10 Sep 2024 05:29:18 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880628#comment-17880628
 ]


Tim Allison commented on TIKA-4307:
-----------------------------------

I asked for help from fellow POI devs: 
https://bz.apache.org/bugzilla/show_bug.cgi?id=69314

> Text in header not extracted for Microsoft Word doc file
> --------------------------------------------------------
>
>                 Key: TIKA-4307
>                 URL: https://issues.apache.org/jira/browse/TIKA-4307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.2
>            Reporter: August Valera
>            Priority: Major
>         Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, 
> 560702J-full-output.txt, 560702J.doc, screenshot-1.png
>
>
> We have a Microsoft Word doc file with text in the header. That header text 
> is not successfully extracted alongside the file content, but converting the 
> file to a docx file results in successful extraction.
> Samples are attached, conversion done using cloudconvert.com.
>  * [^560702J.doc] Original doc file, missing content
>  * [^560702J-converted.docx] Converted to docx file, correct output
>  * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing 
> content
> h3. Current Behavior
> doc files omit header text. docx files extract header text correctly.
> h3. Expected Behavior
> doc and docx files with identical content in header should result in 
> identical output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

Reply via email to