[jira] [Comment Edited] (TIKA-1130) .docx text extract leaves out some portions of text

Daniel Gibby (JIRA) Mon, 24 Jun 2013 07:49:14 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692049#comment-13692049
 ]


Daniel Gibby edited comment on TIKA-1130 at 6/24/13 2:47 PM:
-------------------------------------------------------------

Looks like the POI bug 
(https://issues.apache.org/bugzilla/show_bug.cgi?id=54849) was updated to 
"Resolved Fixed". I've downloaded svn sources of POI and Tika, but I'm not sure 
where the POI code gets located in Tika. What needs to be done to test the 
updated POI code?
                
      was (Author: dangby):
    Looks like the POI bug 
(https://issues.apache.org/bugzilla/show_bug.cgi?id=54849) was updated to 
resolved fixed. I've downloaded svn sources of POI and Tika, but I'm not sure 
where the POI code gets located in Tika. What needs to be done to test the 
updated POI code?
                  
> .docx text extract leaves out some portions of text
> ---------------------------------------------------
>
>                 Key: TIKA-1130
>                 URL: https://issues.apache.org/jira/browse/TIKA-1130
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2, 1.3
>         Environment: OpenJDK x86_64
>            Reporter: Daniel Gibby
>            Priority: Critical
>         Attachments: Resume 6.4.13.docx
>
>
> When parsing a Microsoft Word .docx 
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions 
> of text are what are not extracted, while the darker colored text extracts 
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is 
> all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (TIKA-1130) .docx text extract leaves out some portions of text

Reply via email to